Jonathan Eisen @phylogenomics talk for #LAMG12
-
Upload
jonathan-eisen -
Category
Health & Medicine
-
view
3.875 -
download
1
Transcript of Jonathan Eisen @phylogenomics talk for #LAMG12
Phylogenomic Approaches to the Study of Microbial Diversity
September 16, 2012Lake Arrowhead Microbial Genomes
#LAMG12
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
Sunday, September 16, 12
A Bit of History
• For the real story about the Lake Arrowhead Microbial Genomes meetings see http://tinyurl.com/LAMG12
• But the key to LAMG meetings are ...
Sunday, September 16, 12
Quotes
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
• Microbes not only have a lot of sex, they have a lot of weird sex
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
• Microbes not only have a lot of sex, they have a lot of weird sex
• Gene sequences are the wormhole that allows one to tunnel into the past
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
• Microbes not only have a lot of sex, they have a lot of weird sex
• Gene sequences are the wormhole that allows one to tunnel into the past
• This is how you do metagenomics on 50 dollars, and that’s Canadian dollars
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
• Microbes not only have a lot of sex, they have a lot of weird sex
• Gene sequences are the wormhole that allows one to tunnel into the past
• This is how you do metagenomics on 50 dollars, and that’s Canadian dollars
• The human guts are a real milieu of stuff
Sunday, September 16, 12
Quotes
• Space-time continuum of genes and genomes
• Microbes not only have a lot of sex, they have a lot of weird sex
• Gene sequences are the wormhole that allows one to tunnel into the past
• This is how you do metagenomics on 50 dollars, and that’s Canadian dollars
• The human guts are a real milieu of stuff• Antibiotics do not kill things, they corrupt
themSunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
Sunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
• The rectal swabs are here in tan color
Sunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
• The rectal swabs are here in tan color• If I have time I will tell you about a dream
Sunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
• The rectal swabs are here in tan color• If I have time I will tell you about a dream• Another thing you need to know" pause
"Actually you don't NEED to know any of this
Sunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
• The rectal swabs are here in tan color• If I have time I will tell you about a dream• Another thing you need to know" pause
"Actually you don't NEED to know any of this
• I have been influenced by Fisher Price throughout my life
Sunday, September 16, 12
Quotes
• There comes a point in life when you have to bring chemists into the picture
• The rectal swabs are here in tan color• If I have time I will tell you about a dream• Another thing you need to know" pause
"Actually you don't NEED to know any of this
• I have been influenced by Fisher Price throughout my life
• This is going to be ironic coming from someone who studies circumcision
Sunday, September 16, 12
Quotes
• And we will bring out the unused cheese from yesterday
Sunday, September 16, 12
Quotes
• And we will bring out the unused cheese from yesterday
• A paper came out next year
Sunday, September 16, 12
Quotes
• And we will bring out the unused cheese from yesterday
• A paper came out next year• It takes 1000 nanobiologists to make one
microbiologist
Sunday, September 16, 12
Quotes
• And we will bring out the unused cheese from yesterday
• A paper came out next year• It takes 1000 nanobiologists to make one
microbiologist• In an engineering sense, the vagina is a
simple plug flow reactor
Sunday, September 16, 12
Phylogenomic Approaches to Studying Microbial Diversity
Example 1:
Phylotyping and
Phylogenetic Diversity
Sunday, September 16, 12
DNA extraction
PCRSequence
rRNA genes
Sequence alignment = Data matrix
PCR
rRNA1
rRNA2
Makes lots of copies of the rRNA genes in sample
rRNA1 5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
CrRNA2
5’..TACAGTATAGGTGGAGCTAGCGACGATCGA... 3’
rRNA3 5’...ACGGCAAAATAGGTGGATT
CTAGCGATATAGA... 3’
rRNA4 5’...ACGGCCCGATAGGTGGATT
CTAGCGCCATAGA... 3’
rRNA3 C A C T G T
rRNA4 C A C A G T
Yeast T A C A G T
rRNA Phylotyping
Sunday, September 16, 12
rRNA Phylotyping
Sunday, September 16, 12
E. coli Humans
Yeast
rRNA Phylotyping
Sunday, September 16, 12
OTU2
E. coli Humans
OTU1
Yeast
OTU3 OTU4
E. coli Humans
Yeast
rRNA Phylotyping
Sunday, September 16, 12
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
rRNA Phylotyping
Sunday, September 16, 12
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
rRNA Phylotyping
Sunday, September 16, 12
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
OTU1
OTU2
OTU3
OTU4
rRNA Phylotyping
Sunday, September 16, 12
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
OTU2
E. coli Humans
OTU1
Yeast
OTU3 OTU4OTU1
OTU2
OTU3
OTU4
rRNA Phylotyping
Sunday, September 16, 12
E. coli Humans
Yeast
rRNA Phylotyping
Sunday, September 16, 12
E. coli Humans
Yeast
Just Phylogeny
rRNA Phylotyping
Sunday, September 16, 12
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
OTU2
E. coli Humans
OTU1
Yeast
OTU3 OTU4OTU1
OTU2
OTU3
OTU4
E. coli Humans
Yeast
Just Phylogeny
rRNA Phylotyping
Sunday, September 16, 12
• OTUs• Taxonomic lists• Relative abundance of taxa• Ecological metrics (alpha and beta diversity)
• Phylogenetic metrics• Binning• Identification of novel groups• Clades• Rates of change• LGT• Convergence• PD• Phylogenetic ecology (e.g., Unifrac)
rRNA Phylotyping
Sunday, September 16, 12
What’s New in Phylotyping
Sunday, September 16, 12
What’s New in Phylotyping I
• More PCR products
• Deeper sequencing• The rare biosphere• Relative abundance estimates
• More samples (with barcoding)• Times series• Spatially diverse sampling• Fine scale sampling
Sunday, September 16, 12
Beta-Diversity
a broader range of Proteobacteria, but yielded similar results(Fig. S1 and Tables S2 and S3).Across all samples, we identified 4,931 quality Nitrosomadales
sequences, which grouped into 176 OTUs (operational taxo-nomic units) using an arbitrary 99% sequence similarity cutoff.This cutoff retained a high amount of sequence diversity, butminimized the chance of including diversity because of se-quencing or PCR errors. Most (95%) of the sequences appearclosely related either to the marine Nitrosospira-like clade,known to be abundant in estuarine sediments (e.g., ref. 19) or tomarine bacterium C-17, classified as Nitrosomonas (20) (Fig. S2).Pairwise community similarity between the samples was calcu-lated based on the presence or absence of each OTU usinga rarefied Sørensen’s index (4). Community similarity using thisincidence index was highly correlated with the abundance-basedSørensen index (Mantel test: ! = 0.9239; P = 0.0001) (21).A plot of community similarity versus geographic distance for
each pairwise set of samples revealed that the Nitrosomonadalesdisplay a significant, negative distance-decay curve (slope = !0.08,P < 0.0001) (Fig. 2). Furthermore, the slope of this curve variedsignificantly among the three spatial scales. The distance-decayslope within marshes was significantly shallower than the overallslope (slope=!0.04;P< 0.0334) and steeper acrossmarsheswithina region than the overall slope (slope= !0.27, P < 0.0007) (Fig. 2).In contrast, at the continental scale, the distance-decay curve didnot differ from zero (P = 0.0953). Thus, there is no evidence thatsampling across continents contributed Nitromonadales OTU di-versity in addition to what was already observed at the marsh andregional scales. Furthermore, additional analyses suggest that theseresults are not driven by a few outlier samples (Fig. S3).Over all spatial scales, both the environment and dispersal lim-
itation appear to influence Nitrosomonadales "-diversity. Rankedpartial Mantel tests revealed that the similarity in Nitrosomo-nadales community composition between samples was highly cor-related with environmental distance (!=!0.5339; P=0.0001) andgeographic distance (! = !0.2803; P = 0.0001), but not plantcommunity similarity (P = 0.72) (Table S2).To further identify the relative importance of factors con-
tributing to these correlations, we used a multiple regression onmatrices (MRM). The partial regression coefficients of an MRMmodel give a measure of the rate of change in community sim-ilarity per standardized unit of similarity for the variable of in-terest; all other explanatory variables are held constant (22).Over all scales, the MRMmodel explained a large and significantproportion (R2 = 46%; P < 0.0001) of the variability in Nitro-
somonadales community similarity. Geographic distance con-tributed the largest partial regression coefficient (b = 0.40,P < 0.0001), with sediment moisture, nitrate concentration, plantcover, salinity, and air and water temperature contributing tosmaller, but significant, partial regression coefficients (b = 0.09–0.17, P < 0.05) (Table 1). Because salt marsh bacteria may bedispersing through ocean currents, we also used a global oceancirculation model (23), as applied previously (24), to estimaterelative dispersal times of hypothetical microbial cells betweeneach sampling location. Dispersal times between sampling pointsdid not explain more variability in bacterial community similarity(ln dispersal time: b= 0.06, P= !0.0799; with dispersal R2 = 0.47vs. without 0.46). Therefore, in the remaining analyses we usegeographic distance rather than dispersal time.As hypothesized, the relative importance of environmental
factors versus geographic distance to Nitrosomadales communitysimilarity differed across the three spatial scales. Contrary to ourexpectations, however, geographic distance had a strong effecton community similarity within salt marshes (partial regressioncoefficient b = 0.47) but no effect at larger scales (Table 1).Furthermore, the relative importance of different environmentalvariables varied by scale. Sediment moisture, which is likely re-lated to unmeasured variables, such as oxygen availability, wasthe most important variable explaining community similaritywithin marshes (b = 0.63). In contrast, water temperature (b =0.45) and nitrate concentrations (b = 0.17) were more importantat the regional and continental scales, respectively.The varying importance of the environmental parameters at
different spatial scales likely reflects differences in their un-derlying variability at these scales. For example, the MRMmodeldid exceptionally well in explaining variation in Nitrosomadalescommunity similarity at the regional scale (R2 = 0.61) (Table 1).Notably, this spatial scale captures a latitudinal gradient on theeast and west coasts of North America, which results in highvariability in water temperature. Previous studies in the field andlaboratory support the idea that AOB composition is particularlysensitive to temperature (e.g., refs. 25 and 26). Within marshes,
Fig. 1. The 13 marshes sampled (see Table S1 for details). Marshes com-pared with one another within regions are circled. (Inset) The arrangementof sampling points within marshes. Six points were sampled along a 100-mtransect, and a seventh point was sampled "1 km away. Two marshes in theNortheast United States (outlined stars) were sampled more intensively,along four 100-m transects in a grid pattern.
Fig. 2. Distance-decay curves for the Nitrosomadales communities. Thedashed, blue line denotes the least-squares linear regression across all spatialscales. The solid lines denote separate regressions within each of the threespatial scales: within marshes, regional (across marshes within regions circled inFig. 1), and continental (across regions). The slopes of all lines (except the solidlight blue line) are significantly less than zero. The slopes of the solid red linesare significantly different from the slope of the all scale (blue dashed) line.
Martiny et al. PNAS | May 10, 2011 | vol. 108 | no. 19 | 7851
ECOLO
GY
a broader range of Proteobacteria, but yielded similar results(Fig. S1 and Tables S2 and S3).Across all samples, we identified 4,931 quality Nitrosomadales
sequences, which grouped into 176 OTUs (operational taxo-nomic units) using an arbitrary 99% sequence similarity cutoff.This cutoff retained a high amount of sequence diversity, butminimized the chance of including diversity because of se-quencing or PCR errors. Most (95%) of the sequences appearclosely related either to the marine Nitrosospira-like clade,known to be abundant in estuarine sediments (e.g., ref. 19) or tomarine bacterium C-17, classified as Nitrosomonas (20) (Fig. S2).Pairwise community similarity between the samples was calcu-lated based on the presence or absence of each OTU usinga rarefied Sørensen’s index (4). Community similarity using thisincidence index was highly correlated with the abundance-basedSørensen index (Mantel test: ! = 0.9239; P = 0.0001) (21).A plot of community similarity versus geographic distance for
each pairwise set of samples revealed that the Nitrosomonadalesdisplay a significant, negative distance-decay curve (slope = !0.08,P < 0.0001) (Fig. 2). Furthermore, the slope of this curve variedsignificantly among the three spatial scales. The distance-decayslope within marshes was significantly shallower than the overallslope (slope=!0.04;P< 0.0334) and steeper acrossmarsheswithina region than the overall slope (slope= !0.27, P < 0.0007) (Fig. 2).In contrast, at the continental scale, the distance-decay curve didnot differ from zero (P = 0.0953). Thus, there is no evidence thatsampling across continents contributed Nitromonadales OTU di-versity in addition to what was already observed at the marsh andregional scales. Furthermore, additional analyses suggest that theseresults are not driven by a few outlier samples (Fig. S3).Over all spatial scales, both the environment and dispersal lim-
itation appear to influence Nitrosomonadales "-diversity. Rankedpartial Mantel tests revealed that the similarity in Nitrosomo-nadales community composition between samples was highly cor-related with environmental distance (!=!0.5339; P=0.0001) andgeographic distance (! = !0.2803; P = 0.0001), but not plantcommunity similarity (P = 0.72) (Table S2).To further identify the relative importance of factors con-
tributing to these correlations, we used a multiple regression onmatrices (MRM). The partial regression coefficients of an MRMmodel give a measure of the rate of change in community sim-ilarity per standardized unit of similarity for the variable of in-terest; all other explanatory variables are held constant (22).Over all scales, the MRMmodel explained a large and significantproportion (R2 = 46%; P < 0.0001) of the variability in Nitro-
somonadales community similarity. Geographic distance con-tributed the largest partial regression coefficient (b = 0.40,P < 0.0001), with sediment moisture, nitrate concentration, plantcover, salinity, and air and water temperature contributing tosmaller, but significant, partial regression coefficients (b = 0.09–0.17, P < 0.05) (Table 1). Because salt marsh bacteria may bedispersing through ocean currents, we also used a global oceancirculation model (23), as applied previously (24), to estimaterelative dispersal times of hypothetical microbial cells betweeneach sampling location. Dispersal times between sampling pointsdid not explain more variability in bacterial community similarity(ln dispersal time: b= 0.06, P= !0.0799; with dispersal R2 = 0.47vs. without 0.46). Therefore, in the remaining analyses we usegeographic distance rather than dispersal time.As hypothesized, the relative importance of environmental
factors versus geographic distance to Nitrosomadales communitysimilarity differed across the three spatial scales. Contrary to ourexpectations, however, geographic distance had a strong effecton community similarity within salt marshes (partial regressioncoefficient b = 0.47) but no effect at larger scales (Table 1).Furthermore, the relative importance of different environmentalvariables varied by scale. Sediment moisture, which is likely re-lated to unmeasured variables, such as oxygen availability, wasthe most important variable explaining community similaritywithin marshes (b = 0.63). In contrast, water temperature (b =0.45) and nitrate concentrations (b = 0.17) were more importantat the regional and continental scales, respectively.The varying importance of the environmental parameters at
different spatial scales likely reflects differences in their un-derlying variability at these scales. For example, the MRMmodeldid exceptionally well in explaining variation in Nitrosomadalescommunity similarity at the regional scale (R2 = 0.61) (Table 1).Notably, this spatial scale captures a latitudinal gradient on theeast and west coasts of North America, which results in highvariability in water temperature. Previous studies in the field andlaboratory support the idea that AOB composition is particularlysensitive to temperature (e.g., refs. 25 and 26). Within marshes,
Fig. 1. The 13 marshes sampled (see Table S1 for details). Marshes com-pared with one another within regions are circled. (Inset) The arrangementof sampling points within marshes. Six points were sampled along a 100-mtransect, and a seventh point was sampled "1 km away. Two marshes in theNortheast United States (outlined stars) were sampled more intensively,along four 100-m transects in a grid pattern.
Fig. 2. Distance-decay curves for the Nitrosomadales communities. Thedashed, blue line denotes the least-squares linear regression across all spatialscales. The solid lines denote separate regressions within each of the threespatial scales: within marshes, regional (across marshes within regions circled inFig. 1), and continental (across regions). The slopes of all lines (except the solidlight blue line) are significantly less than zero. The slopes of the solid red linesare significantly different from the slope of the all scale (blue dashed) line.
Martiny et al. PNAS | May 10, 2011 | vol. 108 | no. 19 | 7851
ECOLO
GYDrivers of bacterial !-diversity depend on spatial scale
Jennifer B. H. Martinya,1, Jonathan A. Eisenb, Kevin Pennc, Steven D. Allisona,d, and M. Claire Horner-Devinee
aDepartment of Ecology and Evolutionary Biology, and dDepartment of Earth System Science, University of California, Irvine, CA 92697; bDepartment ofEvolution and Ecology, University of California Davis Genome Center, Davis, CA 95616; cCenter for Marine Biotechnology and Biomedicine, The ScrippsInstitution of Oceanography, University of California at San Diego, La Jolla, CA 92093; and eSchool of Aquatic and Fishery Sciences, University of Washington,Seattle, WA 98195
Edited by Edward F. DeLong, Massachusetts Institute of Technology, Cambridge, MA, and approved March 31, 2011 (received for review November 1, 2010)
The factors driving !-diversity (variation in community composi-tion) yield insights into the maintenance of biodiversity on theplanet. Here we tested whether the mechanisms that underliebacterial !-diversity vary over centimeters to continental spatialscales by comparing the composition of ammonia-oxidizing bacte-ria communities in salt marsh sediments. As observed in studiesof macroorganisms, the drivers of salt marsh bacterial !-diversitydepend on spatial scale. In contrast to macroorganism studies,however, we found no evidence of evolutionary diversificationof ammonia-oxidizing bacteria taxa at the continental scale, de-spite an overall relationship between geographic distance andcommunity similarity. Our data are consistent with the idea thatdispersal limitation at local scales can contribute to !-diversity,even though the 16S rRNA genes of the relatively common taxaare globally distributed. These results highlight the importanceof considering multiple spatial scales for understanding microbialbiogeography.
microbial composition | distance-decay | Nitrosomonadales | ecological drift
Biodiversity supports the ecosystem processes upon which so-ciety depends (1). Understanding the mechanisms that gen-
erate andmaintain biodiversity is thus key to predicting ecosystemresponses to future environmental changes. The decrease incommunity similarity with geographic distance is a universalbiogeographic pattern observed in communities from alldomains of life (as in refs. 2–4). Pinpointing the underlyingcauses of this “distance-decay” pattern continues to be an area ofintense research (5–9), as such studies of !-diversity (variation incommunity composition) yield insights into the maintenance ofbiodiversity. These studies are still relatively rare for micro-organisms, however, and thus our understanding of the mecha-nisms underlying microbial diversity—most of the tree of life—remains limited.!-Diversity, and therefore distance-decay patterns, could be
driven solely by differences in environmental conditions acrossspace, a hypothesis summed up by microbiologists as, “every-thing is everywhere—the environmental selects” (10). Under thismodel, a distance-decay curve is observed because environmen-tal variables tend to be spatially autocorrelated, and organismswith differing niche preferences are selected from the availablepool of taxa as the environment changes with distance.Dispersal limitation can also give rise to !-diversity, as it per-
mits historical contingencies to influence present-day biogeo-graphic patterns. For example, neutral niche models, in which anorganism’s abundance is not influenced by its environmentalpreferences, predict a distance-decay curve (8, 11). On relativelyshort time scales, stochastic births and deaths contribute toa heterogeneous distribution of taxa (ecological drift). On longertime scales, stochastic genetic processes allow for taxon di-versification across the landscape (evolutionary drift). If dispersalis limiting, then current environmental or biotic conditions willnot fully explain the distance-decay curve, and thus geographicdistance will be correlated with community similarity even aftercontrolling for other factors (2).For macroorganisms, the relative contribution of environ-
mental factors or dispersal limitation to !-diversity depends on
spatial scale (12). Fifty-years ago, Preston (13) noted that theturnover rate (rate of change) of bird species composition acrossspace within a continent is lower than that across continents. Heattributed the high turnover rate across continents to evolu-tionary diversification (i.e., speciation) between faunas as a resultof dispersal limitation and the lower turnover rates of bird spe-cies within continents as a result of environmental variation.Here we investigate whether the mechanisms underlying !-
diversity in bacteria also vary by spatial scale. We chose to focuson the ammonia-oxidizing bacteria (AOB), which along with theammonia-oxidizing archaea (14), perform the rate-limiting step ofnitrification and thus play a key role in nitrogen dynamics. Wecompared AOB community composition in 106 sediment samplesfrom 12 salt marshes on three continents. A partially nestedsampling design achieved a relatively balanced distribution ofpairwise distance classes over nine orders of magnitude, from3 cm to 12,500 km (Fig. 1 and Table S1). We limited our sam-pling to a monophyletic group of bacteria, the AOB within the!-Proteobacteria, and one habitat, salt marshes primarily domi-nated by cordgrass (Spartina spp.). This approach constrainedthe pool of total diversity (richness) and kept the environmentaland plant variation relatively constant, increasing our ability toidentify if dispersal limitation influences AOB composition.We then asked two questions: (i) Does bacterial !-diversity—
specifically, the slope of the distance-decay curve—vary overlocal (within marsh), regional (across marshes within a coast),and continental scales? (ii) Do the underlying factors (environ-mental variation or dispersal limitation) explaining this diversityvary by spatial scale? Because most bacteria are small, abundant,and hardy, we predicted that dispersal limitation would occurprimarily across continents, resulting in genetically divergentmicrobial “provinces” (15). At the same time, we predicted thatenvironmental factors would contribute equally to distance-decay at all scales, resulting in the steepest slope at the continentalscale as reported in plant and animal communities (12, 13, 16).
Results and DiscussionWe characterized AOB community composition by cloning andSanger sequencing of 16S rRNA gene regions targeted by twoprimer sets. Here we focus on the results from a subset of thosesequences from the order Nitrosomonadales, generated usingprimers specific for AOB within the !-Proteobacteria class (17).The second primer set (18) generated longer sequences from
Author contributions: J.B.H.M. and M.C.H.-D. designed research; J.B.H.M., J.A.E., K.P., andM.C.H.-D. performed research; J.B.H.M., S.D.A., and M.C.H.-D. analyzed data; and J.B.H.M.and M.C.H.-D. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in the Gen-Bank database (accession nos. HQ271472–HQ276885 and HQ276886–HQ283075).1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1016308108/-/DCSupplemental.
7850–7854 | PNAS | May 10, 2011 | vol. 108 | no. 19 www.pnas.org/cgi/doi/10.1073/pnas.1016308108
Drivers of bacterial !-diversity depend on spatial scaleJennifer B. H. Martinya,1, Jonathan A. Eisenb, Kevin Pennc, Steven D. Allisona,d, and M. Claire Horner-Devinee
aDepartment of Ecology and Evolutionary Biology, and dDepartment of Earth System Science, University of California, Irvine, CA 92697; bDepartment ofEvolution and Ecology, University of California Davis Genome Center, Davis, CA 95616; cCenter for Marine Biotechnology and Biomedicine, The ScrippsInstitution of Oceanography, University of California at San Diego, La Jolla, CA 92093; and eSchool of Aquatic and Fishery Sciences, University of Washington,Seattle, WA 98195
Edited by Edward F. DeLong, Massachusetts Institute of Technology, Cambridge, MA, and approved March 31, 2011 (received for review November 1, 2010)
The factors driving !-diversity (variation in community composi-tion) yield insights into the maintenance of biodiversity on theplanet. Here we tested whether the mechanisms that underliebacterial !-diversity vary over centimeters to continental spatialscales by comparing the composition of ammonia-oxidizing bacte-ria communities in salt marsh sediments. As observed in studiesof macroorganisms, the drivers of salt marsh bacterial !-diversitydepend on spatial scale. In contrast to macroorganism studies,however, we found no evidence of evolutionary diversificationof ammonia-oxidizing bacteria taxa at the continental scale, de-spite an overall relationship between geographic distance andcommunity similarity. Our data are consistent with the idea thatdispersal limitation at local scales can contribute to !-diversity,even though the 16S rRNA genes of the relatively common taxaare globally distributed. These results highlight the importanceof considering multiple spatial scales for understanding microbialbiogeography.
microbial composition | distance-decay | Nitrosomonadales | ecological drift
Biodiversity supports the ecosystem processes upon which so-ciety depends (1). Understanding the mechanisms that gen-
erate andmaintain biodiversity is thus key to predicting ecosystemresponses to future environmental changes. The decrease incommunity similarity with geographic distance is a universalbiogeographic pattern observed in communities from alldomains of life (as in refs. 2–4). Pinpointing the underlyingcauses of this “distance-decay” pattern continues to be an area ofintense research (5–9), as such studies of !-diversity (variation incommunity composition) yield insights into the maintenance ofbiodiversity. These studies are still relatively rare for micro-organisms, however, and thus our understanding of the mecha-nisms underlying microbial diversity—most of the tree of life—remains limited.!-Diversity, and therefore distance-decay patterns, could be
driven solely by differences in environmental conditions acrossspace, a hypothesis summed up by microbiologists as, “every-thing is everywhere—the environmental selects” (10). Under thismodel, a distance-decay curve is observed because environmen-tal variables tend to be spatially autocorrelated, and organismswith differing niche preferences are selected from the availablepool of taxa as the environment changes with distance.Dispersal limitation can also give rise to !-diversity, as it per-
mits historical contingencies to influence present-day biogeo-graphic patterns. For example, neutral niche models, in which anorganism’s abundance is not influenced by its environmentalpreferences, predict a distance-decay curve (8, 11). On relativelyshort time scales, stochastic births and deaths contribute toa heterogeneous distribution of taxa (ecological drift). On longertime scales, stochastic genetic processes allow for taxon di-versification across the landscape (evolutionary drift). If dispersalis limiting, then current environmental or biotic conditions willnot fully explain the distance-decay curve, and thus geographicdistance will be correlated with community similarity even aftercontrolling for other factors (2).For macroorganisms, the relative contribution of environ-
mental factors or dispersal limitation to !-diversity depends on
spatial scale (12). Fifty-years ago, Preston (13) noted that theturnover rate (rate of change) of bird species composition acrossspace within a continent is lower than that across continents. Heattributed the high turnover rate across continents to evolu-tionary diversification (i.e., speciation) between faunas as a resultof dispersal limitation and the lower turnover rates of bird spe-cies within continents as a result of environmental variation.Here we investigate whether the mechanisms underlying !-
diversity in bacteria also vary by spatial scale. We chose to focuson the ammonia-oxidizing bacteria (AOB), which along with theammonia-oxidizing archaea (14), perform the rate-limiting step ofnitrification and thus play a key role in nitrogen dynamics. Wecompared AOB community composition in 106 sediment samplesfrom 12 salt marshes on three continents. A partially nestedsampling design achieved a relatively balanced distribution ofpairwise distance classes over nine orders of magnitude, from3 cm to 12,500 km (Fig. 1 and Table S1). We limited our sam-pling to a monophyletic group of bacteria, the AOB within the!-Proteobacteria, and one habitat, salt marshes primarily domi-nated by cordgrass (Spartina spp.). This approach constrainedthe pool of total diversity (richness) and kept the environmentaland plant variation relatively constant, increasing our ability toidentify if dispersal limitation influences AOB composition.We then asked two questions: (i) Does bacterial !-diversity—
specifically, the slope of the distance-decay curve—vary overlocal (within marsh), regional (across marshes within a coast),and continental scales? (ii) Do the underlying factors (environ-mental variation or dispersal limitation) explaining this diversityvary by spatial scale? Because most bacteria are small, abundant,and hardy, we predicted that dispersal limitation would occurprimarily across continents, resulting in genetically divergentmicrobial “provinces” (15). At the same time, we predicted thatenvironmental factors would contribute equally to distance-decay at all scales, resulting in the steepest slope at the continentalscale as reported in plant and animal communities (12, 13, 16).
Results and DiscussionWe characterized AOB community composition by cloning andSanger sequencing of 16S rRNA gene regions targeted by twoprimer sets. Here we focus on the results from a subset of thosesequences from the order Nitrosomonadales, generated usingprimers specific for AOB within the !-Proteobacteria class (17).The second primer set (18) generated longer sequences from
Author contributions: J.B.H.M. and M.C.H.-D. designed research; J.B.H.M., J.A.E., K.P., andM.C.H.-D. performed research; J.B.H.M., S.D.A., and M.C.H.-D. analyzed data; and J.B.H.M.and M.C.H.-D. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in the Gen-Bank database (accession nos. HQ271472–HQ276885 and HQ276886–HQ283075).1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1016308108/-/DCSupplemental.
7850–7854 | PNAS | May 10, 2011 | vol. 108 | no. 19 www.pnas.org/cgi/doi/10.1073/pnas.1016308108
Sunday, September 16, 12
Earth Microbiome Project
Sunday, September 16, 12
Microbial Range Maps
Sunday, September 16, 12
Things You Could Do
• Mississippi River: 2320 miles long
Sunday, September 16, 12
Things You Could Do
• Mississippi River: 2320 miles long• 1 site / mile• 3 samples / site• 6960 samples
• rRNA PCR w/ barcodes• metagenomics w/ barcodes
• Miseq Run: • 30 million sequence reads• 4310 sequences / sample
• Hiseq 2000• 6 billion sequence reads• 862,068 sequences / sample
Sunday, September 16, 12
Things You Could Do
• Mississippi River: 12,249,600 feet long• 1 site / 500 feet• 3 samples / site• 73497 samples
• rRNA PCR w/ barcodes• metagenomics w/ barcodes
• Miseq Run: • 30 million sequence reads• 408 sequences / sample
• Hiseq 2000• 6 billion sequence reads• 81,635 sequences / sample
Sunday, September 16, 12
What’s New in Phylotyping II
• Metagenomics avoids biases of rRNA PCR
shotgunsequence
Sunday, September 16, 12
Metagenomic Phylotyping
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
OTU2
E. coli Humans
OTU1
Yeast
OTU3 OTU4OTU1
OTU2
OTU3
OTU4
E. coli Humans
Yeast
Just Phylogeny
Sunday, September 16, 12
Phylogenetic Challenge
??
Sunday, September 16, 12
Phylogenetic Challenge
??
Sunday, September 16, 12
Phylogenetic Challenge
Multiple approaches
Sunday, September 16, 12
Method 1: Each is an island
Sunday, September 16, 12
Method 1: Each is an island
• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time
Sunday, September 16, 12
Method 1: Each is an island
• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time
Sunday, September 16, 12
Method 1: Each is an island
• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time
Sunday, September 16, 12
STAP
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Wu et al. 2008 PLoS One
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Each sequence analyzed separately
Sunday, September 16, 12
AMPHORA
Guide tree
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Sunday, September 16, 12
Phylotyping w/ Proteins
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Sunday, September 16, 12
Method 2: Most in the Family
Sunday, September 16, 12
Phylogenetic Challenge
??
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
Sunday, September 16, 12
Method 2: Most in family
One tree for those w/ overlap
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
Sunday, September 16, 12
Venter et al., Science 304: 66. 2004
rRNA in Sargasso Metagenome
Sunday, September 16, 12
RecA Phylotyping in Sargasso Data
Venter et al., Science 304: 66. 2004
Sunday, September 16, 12
0
0.125
0.250
0.375
0.500
Alphapro
teobacteria
Betap
roteobacteria
Gamm
aproteobacteria
Epsilo
nproteobacteria
Deltapro
teobacteria
Cyanobacteria
Firmicutes
Actinobacteria
Chlorobi
CFB
Chloroflexi
Spirochaetes
Fusobacteria
Deinococcus-Th
ermus
Euryarchaeota
Crenarchaeota
Sargasso Phylotypes
Wei
ghte
d %
of C
lone
s
Major Phylogenetic Group
EFG EFTu HSP70 RecA RpoB rRNA
Sargasso Phylotyping
Venter et al., Science 304: 66. 2004
Sunday, September 16, 12
STAP, QIIME, Mothur
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Combine all into one alignment
Sunday, September 16, 12
WATERsHartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317
Page 2 of 14
sequence rDNA (the genes for ribosomal RNA) in partic-ular those for small-subunit ribosomal RNA (ss-rRNA).These studies revealed a large amount of previouslyundetected microbial diversity [1,11-13]. Researchersfocused on the small subunit rRNA gene not onlybecause of the ease with which it can be PCR amplified,but also because it has variable and highly conservedregions, it is thought to be universally distributed amongall living organisms, and it is useful for inferring phyloge-netic relationships [14,15]. Since then, "cultivation-inde-pendent technologies" have brought a revolution to thefield of microbiology by allowing scientists to study awide and complex amount of diversity in many differenthabitats and environments [16-18]. The general premiseof these methods remains relatively unchanged from theinitial experiments two decades ago and relies onstraightforward molecular biology techniques and bioin-formatics tools from ecology, evolutionary biology andDNA sequencing projects.
Briefly, the lab work involved in 16 S rDNA surveysbegins with environmental samples (e.g., soil or water)from which total genomic DNA is extracted. Next, the 16S rDNA is PCR-amplified with pan-bacterial or pan-archaeal primers (i.e., primers designed to amplify asmany known bacteria or archaea as possible), cloned intoa sequencing vector, and then sequenced (or directlysequenced without cloning in next generation sequenc-ing) resulting in large collections of diverse microbial 16 SrDNA sequences from these different samples. Assequencing costs have continually declined, environmen-tal microbiology surveys have expanded correspondinglyand 16 S rDNA datasets have grown increasingly com-plex.
The size and complexity of data sets introduce a newchallenge - analyses that one could carry out manually onsmall data sets now must be aided or run entirely on com-puters. And those analyses that previously were carriedout computationally now must be made more efficient tohave any hopes of being completed in a timely manner[7,19].
How then is the microbial community sequencing dataconverted from reads off a sequencing machine to bargraphs, network diagrams, and biological conclusions?Fortunately, even as data sets have expanded, mostresearchers analyzing rDNA sequence data sets, evenwhen they are very large, have a similar set of goals intheir analysis. For example, most studies are interested inassigning a microbial identity to the 16 S rDNAsequences and determining the proportion of theseorganisms in each sequence collection. And to achievethese (and related goals), a similar set of steps are used(Fig. 1) including aligning the rDNA sequences in a data-set to each other so that they are comparable, removing
chimeric sequences generated during PCR identifyingclosely related sets of sequences (also known as opera-tional taxonomic units or OTUs), removing redundantsequences above a certain percent identity cutoff, assign-ing putative taxonomic identifiers to each sequence orrepresentative of a group, inferring a phylogenetic tree ofthe sequences, and comparing the phylogenetic structureof different samples to each other and to the larger bacte-rial or archaeal tree of life.
Over the last few years, a large number of softwaretools and web applications have become available to carryout each of the above steps (e.g., [20,21] for chimerachecking, [22] for phylogenetic comparisons, STAP fortaxonomy assignments). In practice, even as new soft-ware became available, researchers still have to act as thedrivers of the workflow. At each step in this process, dif-ferent types of software must be chosen and employed,each with distinct data formatting requirements, invoca-tion methods, and each associated with a variety of post-analysis steps that may be selected and applied. Even afterall of these steps have been completed, a wide variety ofstatistical and visualization tools are applied to theseresults to interpret and represent these data. In this con-text, there is a clear need for tools that will run a compre-hensive set of analyses all linked together into one system.Very recently, two such systems have been released -mothur and QIIME. WATERS is our effort in this regardwith some key differences compared to mothur andQIIME.
Figure 1 Overview of WATERS. Schema of WATERS where white boxes indicate "behind the scenes" analyses that are performed in WA-TERS. Quality control files are generated for white boxes, but not oth-erwise routinely analyzed. Black arrows indicate that metadata (e.g., sample type) has been overlaid on the data for downstream interpre-tation. Colored boxes indicate different types of results files that are generated for the user for further use and biological interpretation. Colors indicate different types of WATERS actors from Fig. 2 which were used: green, Diversity metrics, WriteGraphCoordinates, Diversity graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create-Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile; white, remaining unnamed actors.
AlignCheck
chimerasCluster Build
Tree
AssignTaxonomy
Tree w/Taxonomy
Diversity statistics &
graphs
Unifrac files
Cytoscape network
OTU table
Hartman et al 2010. W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics 2010, 11:317 doi:10.1186/1471-2105-11-317
Hartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317
Page 3 of 14
MotivationsAs outlined above, successfully processing microbialsequence collections is far from trivial. Each step is com-plex and usually requires significant bioinformaticsexpertise and time investment prior to the biologicalinterpretation. In order to both increase efficiency andensure that all best-practice tools are easily usable, wesought to create an "all-inclusive" method for performingall of these bioinformatics steps together in one package.To this end, we have built an automated, user-friendly,workflow-based system called WATERS: a Workflow forthe Alignment, Taxonomy, and Ecology of RibosomalSequences (Fig. 1). In addition to being automated andsimple to use, because WATERS is executed in the Keplerscientific workflow system (Fig. 2) it also has the advan-tage that it keeps track of the data lineage and provenanceof data products [23,24].AutomationThe primary motivation in building WATERS was tominimize the technical, bioinformatics challenges thatarise when performing DNA sequence clustering, phylo-
genetic tree, and statistical analyses by automating the 16S rDNA analysis workflow. We also hoped to exploitadditional features that workflow-based approachesentail, such as optimized execution and data lineagetracking and browsing [23,25-27]. In the earlier days of 16S rDNA analysis, simply knowing which microbes werepresent and whether they were biologically novel was anoteworthy achievement. It was reasonable and expected,therefore, to invest a large amount of time and effort toget to that list of microbes. But now that current effortsare significantly more advanced and often require com-parison of dozens of factors and variables with datasets ofthousands of sequences, it is not practically feasible toprocess these large collections "by hand", and hugely inef-ficient if instead automated methods can be successfullyemployed.Broadening the user baseA second motivation and perspective is that by minimiz-ing the technical difficulty of 16 S rDNA analysis throughthe use of WATERS, we aim to make the analysis of thesedatasets more widely available and allow individuals with
Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-clicking on any actor or connector allows it to be manipulated and re-arranged.
Sunday, September 16, 12
One Major Issue with rRNA
• Copy number varies greatly between taxa• Can lead to significant errors in estimates
of relative abundance from numbers of reads
Sunday, September 16, 12
Kembel Correction)LJXUH��&OLFN�KHUH�WR�GRZQORDG�KLJK�UHVROXWLRQ�LPDJH
Kembel, Wu, Eisen, Green. In press. PLoS Computational Biology. Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance
Sunday, September 16, 12
Method 3: All in the family
Sunday, September 16, 12
??
Phylogenetic Challenge
Sunday, September 16, 12
A single tree with everything?
Phylogenetic Challenge
Sunday, September 16, 12
rRNA analysis
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Cluster
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
OTUs
OTU2
E. coli Humans
OTU1
Yeast
OTU3 OTU4OTU1
OTU2
OTU3
OTU4
E. coli Humans
Yeast
Just Phylogeny
Sunday, September 16, 12
alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and
references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed
statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare
PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.
PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to
commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition
between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
PhylOTU
Sunday, September 16, 12
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
RecA, RpoB in GOS
Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Sunday, September 16, 12
Phylosift/ pplacer
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric Lowe, and others
Sunday, September 16, 12
Phylosift
• Probabilistic Phylogenetic Ecology• https://github.com/gjospin/PhyloSift• http://phylosift.wordpress.com
Sunday, September 16, 12
Method 4: All in the genome
Sunday, September 16, 12
Multiple Genes?
A single tree with everything?
Sunday, September 16, 12
Kembel Combiner
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Sunday, September 16, 12
Kembel Combiner
cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).
Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are
applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.
MATERIALS AND METHODS
Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).
The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:
u ! !i
n
bi " "Ai
AT#
Bi
BT"
Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.
If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:
D ! !j
n
dj " #Aj
AT$
Bj
BT$
Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.
Clustering with normalized u values treats each sample equally instead of
TABLE 1. Measurements of diversity
Measure Measurement of " diversity Measurement of ! diversity
Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that
each taxon was observedQuantitative (species richness and evenness) Quantitative
FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.
VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Sunday, September 16, 12
Uses of Phylogeny in Genomics and Metagenomics
Example 2:
Functional Diversity and Functional Predictions
Sunday, September 16, 12
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Phylogenomics
Sunday, September 16, 12
Diversity of Proteorhodopsins
Venter et al., 2004. Science 304: 66.
Sunday, September 16, 12
Improving Functional Predictions
• Same methods discussed for phylotyping improve phylogenomic functional prediction for protein families
• Increase in sequence diversity helps too
Sunday, September 16, 12
Phylosift/ pplacer
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric Lowe, and others
Sunday, September 16, 12
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65. Sunday, September 16, 12
Wu et al. 2005 PLoS Genetics 1: e65. Sunday, September 16, 12
NMF in MetagenomesCharacterizing the niche-space distributions of componentsS
ite
s
N orth American E ast C oast_G S 005_E mbayment
N orth American E ast C oast_G S 002_C oasta l
N orth American E ast C oast_G S 003_C oasta l
N orth American E ast C oast_G S 007_C oasta l
N orth American E ast C oast_G S 004_C oasta l
N orth American E ast C oast_G S 013_C oasta l
N orth American E ast C oast_G S 008_C oasta l
N orth American E ast C oast_G S 011_E stuary
N orth American E ast C oast_G S 009_C oasta l
E astern Tropica l Pacific_G S 021_C oasta l
N orth American E ast C oast_G S 006_E stuary
N orth American E ast C oast_G S 014_C oasta l
Polynesia Archipelagos_G S 051_C ora l R eef Atoll
G alapagos Islands_G S 036_C oasta l
G alapagos Islands_G S 028_C oasta l
Indian O cean_G S 117a_C oasta l sample
G alapagos Islands_G S 031_C oasta l upwelling
G alapagos Islands_G S 029_C oasta l
G alapagos Islands_G S 030_W arm S eep
G alapagos Islands_G S 035_C oasta l
S argasso S ea_G S 001c_O pen O cean
E astern Tropica l Pacific_G S 022_O pen O cean
G alapagos Islands_G S 027_C oasta l
Indian O cean_G S 149_H arbor
Indian O cean_G S 123_O pen O cean
C aribbean S ea_G S 016_C oasta l S ea
Indian O cean_G S 148_Fringing R eef
Indian O cean_G S 113_O pen O cean
Indian O cean_G S 112a_O pen O cean
C aribbean S ea_G S 017_O pen O cean
Indian O cean_G S 121_O pen O cean
Indian O cean_G S 122a_O pen O cean
G alapagos Islands_G S 034_C oasta l
C aribbean S ea_G S 018_O pen O cean
Indian O cean_G S 108a_Lagoon R eef
Indian O cean_G S 110a_O pen O cean
E astern Tropica l Pacific_G S 023_O pen O cean
Indian O cean_G S 114_O pen O cean
C aribbean S ea_G S 019_C oasta l
C aribbean S ea_G S 015_C oasta l
Indian O cean_G S 119_O pen O cean
G alapagos Islands_G S 026_O pen O cean
Polynesia Archipelagos_G S 049_C oasta l
Indian O cean_G S 120_O pen O cean
Polynesia Archipelagos_G S 048a_C ora l R eef
Component 1
Component 2
Component 3
Component 4
Component 5
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6
0 .2 0 .4 0 .6 0 .8 1 .0
Salin
ity
Sam
ple
Dep
th
Ch
loro
ph
yll
Tem
pera
ture
Inso
lati
on
Wate
r D
ep
th
G enera l
H ighM ediumLowN A
H ighM ediumLowN A
W ater depth
>4000m2000!4000m900!2000m100!200m20!100m0!20m
>4000m2000!4000m900!2000m100!200m20!100m0!20m
(a) (b) (c)
Figure 3: a) Niche-space distributions for our five components (HT ); b) the site-similarity matrix (HT H); c) environmental variables for the sites. The matrices arealigned so that the same row corresponds to the same site in each matrix. Sites areordered by applying spectral reordering to the similarity matrix (see Materials andMethods). Rows are aligned across the three matrices.
Figure 3a shows the estimated niche-space distribution for each of the five com-ponents. Components 2 (Photosystem) and 4 (Unidentified) are broadly distributed;Components 1 (Signalling) and 5 (Unidentified) are largely restricted to a handful ofsites; and component 3 shows an intermediate pattern. There is a great deal of overlapbetween niche-space distributions for di�erent components.
Figure 3b shows the pattern of filtered similarity between sites. We see clear pat-terns of grouping, that do not emerge when we calculate functional distances withoutfiltering, or using PCA rather than NMF filtering (Figure 3 in Text S1). As withthe Pfams, we see clusters roughly associated with our components, but there is moreoverlapping than with the Pfam clusters (Figure 2b).
Figure 3c shows the distribution of environmental variables measured at each site.Inspection of Figure 3 reveals qualitative correspondence between environmental factorsand clusters of similar sites in the similarity matrix. For example, the “North AmericanEast Coast” samples are divided into two groups, one in the top left and the other in thebottom right of the similarity matrix. Inspection of the environmental features suggeststhat the split in these samples could be mostly due to the di�erences in insolation andwater depth.
We can also examine patterns of similarity between the components themselves,using niche-site distributions or functional profiles (see Figure 5 in Text S1). All 5
8
Functional biogeography of ocean microbes revealed through non-negative matrixfactorization Jiang et al. In press PLoS One. Comes out 9/18.
w/ Weitz, Dushoff, Langille, Neches, Levin, etc
Sunday, September 16, 12
Uses of Phylogeny in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
Sunday, September 16, 12
GEBA
http://www.jgi.doe.gov/programs/GEBA/pilot.html
Sunday, September 16, 12
GEBA
http://www.jgi.doe.gov/programs/GEBA/pilot.html
THATISSO
LAMG10
Sunday, September 16, 12
How To Keep Up?
• IMG• Genomes Online• MicrobeDB
• http://github.com/mlangill/microbedb/• Langille MG, Laird MR, Hsiao WW, Chiu TA, Eisen
JA, Brinkman FS. MicrobeDB: a locally maintainable database of microbial genomic sequences. Bioinformatics. 2012 28(14):1947-8.
Sunday, September 16, 12
Improving Phylotyping
Sunday, September 16, 12
More Markers
Phylogenetic group Genome Number
Gene Number
Maker Candidates
Archaea 62 145415 106Actinobacteria 63 267783 136Alphaproteobacteria 94 347287 121Betaproteobacteria 56 266362 311Gammaproteobacteria 126 483632 118Deltaproteobacteria 25 102115 206Epislonproteobacteria 18 33416 455Bacteriodes 25 71531 286Chlamydae 13 13823 560Chloroflexi 10 33577 323Cyanobacteria 36 124080 590Firmicutes 106 312309 87Spirochaetes 18 38832 176Thermi 5 14160 974Thermotogae 9 17037 684
Sunday, September 16, 12
Better Reference Tree
Morgan et al. submitted
Sunday, September 16, 12
Improving Functional Predictions
Sunday, September 16, 12
Sifting FamiliesRepresentative
Genomes
ExtractProtein
Annotation
All v. AllBLAST
HomologyClustering
(MCL)
SFams
Align & Build
HMMs
HMMs
Screen forHomologs
NewGenomes
ExtractProtein
Annotation
Figure 1Sharpton et al. submitted
AB
C
��
�
�
�
�
�� �
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
��
�
�
��
�
�
� ��
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
� �
�
�
��
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� �
��
�
�
�
�
�
�
�
��
�
� �
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
��
�
�
��
�
�
� ��
�
��
�
��
�
�
��
�
��
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
�
� �
�
�
�
��
�
�
�
��
�
�
�
�
�
�
� �
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
��
��
�
�
��
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
�
�
��
� �
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
��
�
� ��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
� �
�
�
� �� �
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
���
�
�
�
�
�
�
� �
�
�
��
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
��
�
�
��
��
�
�
�
�
��
�
�
�
�
�
��
�
� �
�
�
��
�
��
�
��
�
��
�
��
��
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
��
� �
��
� �
�
�
�
�
�
�
��
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
� �
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
��
��
�
�
��
�
�
���
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
� ��
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
� �
�
��
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
� �
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
�
�
�
� �
�
�
��
�
�
�
��
��
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
��
�
�
�
�
��
�
��
��
��
�
�
�
�
�
�
� �
��
�
�
�
�
Sunday, September 16, 12
Zorro - Automated Masking
ce to
Tru
e Tr
ee
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
200 400 800 1600 3200
Dist
ance
to T
rue
Tree
Sequence Length
200
no maskingzorrogblocks
Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/journal.pone.0030288
Sunday, September 16, 12
Phylogenetic Contrasts
Sunday, September 16, 12
GEBA Lesson
We have still only scratched the surface of microbial diversity
Sunday, September 16, 12
PD: All
From Wu et al. 2009 Nature 462, 1056-1060Sunday, September 16, 12
Families/PD not uniform
� �
�������6���
3����1�����
Sunday, September 16, 12
97
Number of SAGs from Candidate Phyla
OD
1
OP
11
OP
3
SA
R4
06
Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA uncultured
Sunday, September 16, 12
Need Experiments from Across the Tree of Life too
GEBA Lesson
Sunday, September 16, 12
Conclusion
Sunday, September 16, 12
Sunday, September 16, 12
MICROBES
Sunday, September 16, 12
Acknowledgements
• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS
• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
Sunday, September 16, 12