EBI is an Outstation of the European Molecular Biology Laboratory.
Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists
Metagenomic nucleotide sequence and annotation: Range of environments
Global ocean survey
Human faecal virus communities
Human distal gut microbiome
Phosphorus removal sludge communities
Obesity-associated gut microbiome
Acidophilicbacterial community
Mouse gut flora
Metagenomic nucleotide sequence and annotation: Data growth: projects
0
5
10
15
20
25
30
Jun-03 Dec-03 Jun-04 Dec-04 Jun-05 Dec-05 Jun-06 Dec-06 Jun-07
Date
Pro
ject
s
Metagenomic nucleotide sequence and annotation: Data growth: volume of dataset
1
10
100
1,000
10,000
Jun-03 Dec-03 Jun-04 Dec-04 Jun-05 Dec-05 Jun-06 Dec-06 Jun-07
Date
Meg
abas
e p
airs
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Seq
uen
ces
Bases
Sequences
Metagenomic nucleotide sequence and annotation: Assembly issues
Most metagenome records have not been assembled into scaffolds in INSDC records (only 4 of 24 projects so far) and remain as unassembled WGS records
Those that have been assembled into scaffolds show very limited assembly - of the four assembled projects, one contains almost as many scaffolds as contigs
Metagenomic nucleotide sequence and annotation: Metadata issues
• Metadata, particularly sampling information, are often not shown, or are provided with limited granularity, restricting re-analysis by users
• INSDC offers appropriate structures for such metadata, but they are frequently not used, even when the information is available to the submitters
Current:FT source 1..2866FT /organism="marine metagenome"FT /environmental_sampleFT /mol_type="genomic DNA"FT /isolation_source="isolated as part of a large datasetFT composed predominantly from surface water marine samplesFT collected along a voyage from Eastern North American coastFT to the Eastern Pacific Ocean, including locations in theFT Sargasso Sea, Panama Canal, and the Galapagos Islands"FT /note="metagenomic"FT /db_xref="taxon:408172"
Could be:FT source 1..2866FT /organism="marine metagenome"FT /environmental_sampleFT /mol_type="genomic DNA"FT /country="French Polynesia: Moorea, Cooks Bay"FT /lat_lon="17.476 S 149.81 W"FT /isolation_source="marine surface water; sampleFT depth: 34M; size range: 0.1-0.8 microns; waterFT temperature: 28.900; salinity: 35.100"FT /db_xref="taxon:408172"
Metagenomic nucleotide sequence and annotation: Taxonomy issues
Taxonomic annotation in metagenomic data is simplistic - a very small number of non-specific taxa are necessarily used to describe all of the raw data
Analysis methodology, particularly binning, is inconsistent across the dataset, so taxonomic assertions in assembled sequence are of uncertain provenance
Standards on whether or not single contigs should contribute to scaffolds for more than one taxon are yet to be established
Metagenomes and UniProt (1/2)
• As of this month, ~6 million protein sequences from Global Ocean Survey have been released (vs. 4,534,260 UniProtKB entries)
• Future exponential increase is anticipated:• The growth of public protein sequence data is exponential with a
doubling time of about 20 months• Metagenomics data will have substantially shorter doubling time• GOS data will more than double the existing protein-coding
sequences in UniProtKB
Metagenomes and UniProt (2/2)
• Perspectives• Vast amount of sequence data• Environmental context in metadata• New kind of data requires new storage, processing, and data mining
procedures
• Taxonomically unassigned data will not be included in the UniProt Knowledgebase
• UniMES – UniProt Metagenomics and Environmental sequences (June 2007)
UniMes requirements
• Distinct storage and dissemination: separated from current UniProt databases.
• Distinct production pipeline• Distinct accession number range: MES followed by 11
hexadecimal numbers, e.g. MES00000000001• Distinct data mining pipelines: less restricted rules due to
the lack of basic knowledge about the taxonomic origin of these sequences
UniMes pipeline overview
EMBL
Primary data
Genomic sequence (EMBL)
Other SubmissionsMetagenomics data
(WGS)
UniProt KnowledgebaseUniProt Metagenomics UniProt Archive
Classification
Clustering
Automatic annotationrules
Secondary analysis
Secondary analysis
DNA Metagenomics (to be established)
UniProtKB vs.UniMes Database growth
0
1
2
3
4
5
6
2005 2006 2007
UniProt Knowledgebase (in mln of sequences)UniProt Metagenomics and Environmental Sequences (in mln of sequences)
0
50
100
150
200
250
300
2005 2006 2007
UniProt Metagenomics and Environmental Sequences (storage in GB)
UniMes storage growth
UniMes hardware requirements (1/2)
• 2 HP/Compaq AlphaServers ES45 with 4 1250MHz CPU’s and 12GB Memory
Oracle database designed to store and maintain data
derived from EMBL
Oracle Warehouse for data analysis, integration
and display
• 64-bit linux farm (AMD operon) using 40 nodes for data mining procedures
UniMes hardware requirements (2/2)
• New oracle servers: Sunfire v490 with 4 1500MHz UltraSparc IV CPUS’s and 16 GB memory
• We have enough physical storage and CPU power for 2007
GOS submission
• Submission of nucleic acid sequence data to EMBL/GenBank/DDBJ is mandatory for publication of scientific paper
• Craig Venter Institute submission to EMBL/GenBank/DDBJ in March 2007
• Environmental metadata can only be found in the CAMERA website
• Metadata are of great importance for metagenomic sequence data:• Descriptions of sampling sites and habitats • Analysis of metagenomics sequence data
• URGENT need for the community to agree on what metadata must be included with the submission of any metagenomics sample
UniMes and GOS data
Protein Sequences
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
10,000,000
GOS UniParc
Amino Acids (bln)
0.00
50.00
100.00
150.00
200.00
250.00
300.00
GOS UniParc
Average Sequence Length
050
100150
200
250300
350400
GOS UniProt KB
Median Sequence Length
0
50
100
150
200
250
300
GOS UniProt KB
UniMes and GOS data
Fragm ents (%)
0
5
1015
20
25
3035
40
45
GOS UniP rot KB
Amino Acid Distribution
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
A R N D C Q E G H I L K M F P S T W Y V B Z X
GOS UniProt KB
UniMes and GOS data
Percentage Coverage by InterPro Methods
79.5
80
80.5
81
81.5
82
82.5
83
83.5
84
UniProt KB UniParc GOS
UniMes and GOS data
Top 10 InterPro entries hitting UniProt: Top 10 InterPro entries hitting GOS
Top 10 InterPro entries hitting UniParc (including GOS):
Top Related