Post on 07-Aug-2019
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 1 of 13
Supplementary Online Material
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
Felipe A. Simão†, Robert M. Waterhouse
†*, Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov
*
Department of Genetic Medicine and Development, University of Geneva Medical School
and Swiss Institute of Bioinformatics, rue Michel-Servet 1, 1211 Geneva, Switzerland. † Equal contribution. * To whom correspondence should be addressed:
Robert.Waterhouse@unige.ch, Evgeny.Zdobnov@unige.ch
Contents:
1. BUSCO: Benchmarking Universal Single-Copy Orthologs ...................................................................... 2
1.1. BUSCO selection ............................................................................................................................... 2
1.2. Hidden Markov models, ancestral sequences and block profiles ...................................................... 2
1.3. Candidate BUSCO matches from genome assemblies ...................................................................... 4
1.4. Gene prediction: assessing genome assemblies and transcriptomes ................................................. 4
1.5. BUSCO match assignment ................................................................................................................ 4
1.6. Classification: Complete, Duplicated, Fragmented, Missing ............................................................ 5
1.7. Training Augustus gene finding parameters ...................................................................................... 5
2. BUSCO completeness versus N50 contiguity ........................................................................................... 5
3. BUSCO versus CEGMA assessment of genome assembly completeness ................................................ 6
4. BUSCO assessments of genomes, transcriptomes, and gene sets ............................................................. 7
5. BUSCO and CEGMA analysis run-times ............................................................................................... 12
6. References ............................................................................................................................................... 13
UEST FOR UALITY
“BUSCO CALIDAD”
“BUSCO QUALIDADE”
http://busco.ezlab.org
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 2 of 13
1. BUSCO: Benchmarking Universal Single-Copy Orthologs
1.1. BUSCO selection
Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups
with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level
orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al.,
2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species
phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others
they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one
sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB
species, excluding only those with unusually high numbers of missing or duplicated orthologs, while
retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can
therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the
appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3’078 BUSCO groups),
41 vertebrates (4’425 BUSCO groups), 93 metazoans (1’008 BUSCO groups), 125 fungi (1’438 BUSCO
groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial
BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce
the final BUSCO sets for each clade, representing 2’675 genes for arthropods, 3’023 for vertebrates, 843 for
metazoans, 1’438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected
from (Mende, et al., 2013).
1.2. Hidden Markov models, ancestral sequences and block profiles
Hidden Markov models: For each BUSCO group, multiple sequence alignments (MSAs) were built with
ClustalOmega (Sievers and Higgins, 2014) using the orthologous protein sequences of each BUSCO. The
MSAs were then used to build amino acid-level hidden Markov model (HMM) profiles using HMMER 3
(Eddy, 2011). Subsequently, all BUSCO input sequences were searched (hmmsearch) against the complete
library of HMM profiles to identify and remove any BUSCO groups whose members could not be reliably
distinguished from each other by their profiles, and hence ensure reliable profile-delineated orthology. In
total, 376, 852, and 156 groups were removed in this way from the arthropod, vertebrate, metazoan sets,
respectively, while none were removed for the fungi or eukaryote datasets. The remaining, reliably-
distinguishable BUSCO sets were then analysed to delineate the two parameters ‘expected-score’ and
‘expected-length’ that define the BUSCO-specific cut-offs used to classify a match as orthologous or not and
as complete or not. The ‘expected score’ cut-off is defined as 90% of the minimum bitscore from an HMM
search of all of a BUSCO group’s members against its own HMM profile (i.e. the lowest scoring match of
the sequences used to build the profile). To be classified as a true ortholog, any BUSCO-matching gene from
the species being assessed (from its genome, transcriptome, or gene set) must score above the ‘expected-
score’ cut-off. For a match to be classified as ‘complete’, it must satisfy the ‘expected-length’ cut-off, which
is defined using each BUSCO group’s protein length distribution (Figure S1). Any BUSCO-matching gene
from the species being assessed whose protein length falls within two standard deviations (2σ) of the
BUSCO group’s mean length is classified as ‘complete’.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 3 of 13
Consensus sequences: For each BUSCO group, an amino acid consensus sequence was generated from its
respective HMM profile using HMMER’s default hmmemit settings for a majority-rule consensus sequence.
These consensus sequences are used during BUSCO assessments of genome assemblies to search the
genome of the species being assessed to identify the best-matching genomic regions that may encode the
corresponding BUSCO-matching gene.
Figure S1. Distribution of the percent differences between BUSCO group member proteins and the
group’s mean protein length (negative = shorter than the mean, positive = longer than the mean, values
of one and two standard deviations are shown with lines). Insets: spread of BUSCO group member
protein lengths compared to BUSCO group mean lengths for arthropods (left) and vertebrates (right).
Block profiles: For each BUSCO group, a ‘block profile’ was built to guide automated gene predictions
with Augustus (Keller, et al., 2011). Block profiles are position-specific frequency matrices that model
conserved regions of multiple sequence alignments. The BUSCO group block profiles were created from
their corresponding protein multiple sequence alignments using the msa2prfl script from the Augustus
package. Several highly-divergent BUSCO groups failed to produce reliable block profiles, even after
processing their alignments with the Augustus preparealign script, and were therefore removed from the
assessment sets: 27, 149, 51, 0 and 2 BUSCO groups were removed from the arthropod, vertebrate,
metazoan, fungi and eukaryote sets, respectively.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 4 of 13
1.3. Candidate BUSCO matches from genome assemblies
Regions in a genome likely to encode BUSCO-matching genes are identified by tBLASTn searches
(Camacho, et al., 2009) with the reconstructed consensus sequences of each BUSCO. Neighbouring high-
scoring segment pairs (HSPs) from the tBLASTn searches are merged if located within 50 Kb of each other,
thus defining the span of the genomic regions to be evaluated. These genomic regions are then ranked
according to the total length of the consensus sequence aligned, and up to three regions are selected for the
subsequent gene prediction steps. The second- and third-ranked regions must have consensus sequence
alignment lengths of at least 70% of the aligned length of the top ranking region. Selecting more than just the
best candidate BUSCO match allows for the identification of normally-rarely duplicated BUSCOs from the
assessed genome, which, if numerous, could indicate potentially erroneously assembled haplotypes. Lastly,
the selected genomic regions are extended with 5 Kbp (small genomes) and 20 Kbp (large genomes) flanking
regions (default parameters, users can specify their own flank-extension lengths).
1.4. Gene prediction: assessing genome assemblies and transcriptomes
The candidate BUSCO-matching regions identified in the previous step are extracted from the genome
being assessed for processing by the Augustus automated gene prediction procedure. Gene prediction is
performed on each candidate region using the corresponding BUSCO group’s block profile, and default gene
finding parameters (unless otherwise specified by the user). Successful Augustus gene prediction for each
BUSCO group produces an initial BUSCO gene set whose protein sequences are then evaluated using the
BUSCO-specific cut-offs to determine true orthology and completeness. High-confidence predicted BUSCO
genes can then be selected from this initial gene set for the training of Augustus to rerun the automated gene
prediction procedure with these specific genome-trained parameters (see below). For assessing
transcriptomes, if the transcripts have not already been pre-processed to extract protein-coding genes then the
longest open reading frame (ORF) is selected for assessment.
1.5. BUSCO match assignment
This step uses the properties of each BUSCO group’s HMM profile to determine whether a significantly
matching protein sequence is likely orthologous or just homologous. Significant matches are first determined
by searching the full set of protein sequences to be assessed against the complete library of BUSCO group
HMM profiles using HMMER’s hmmsearch. As described above, filtering of the initial BUSCO sets ensured
that each library contains only reliably-distinguishable profiles. The set of protein sequences to be assessed
may be from the Augustus-predicted BUSCO gene set, a transcriptome-based gene set, or the annotated
‘Official Gene Set’ (OGS). For each hmmsearch sequence-profile alignment, two measures are computed
and evaluated: the alignment bitscore and the total length of sequence aligned to the HMM profile. For a
BUSCO-matching gene to be considered orthologous, the alignment bitscore must be greater than or equal to
the ‘expected-score’ of the corresponding BUSCO group (see above for ‘expected-score’ definition). Genes
that pass the ‘expected-score’ cut-off are then evaluated for protein length completeness as described below.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 5 of 13
1.6. Classification: Complete, Duplicated, Fragmented, Missing
The final stage of the assessments classify each arthropod, vertebrate, metazoan, fungal, or eukaryote
BUSCO as complete, duplicated, fragmented, or missing from the gene set being assessed. Classification of
BUSCO-matching genes that meet the ‘expected-score’ cut-off employs the protein length distribution of
each BUSCO to determine whether the ortholog is ‘Complete’ or ‘Fragmented’. Orthologs are considered to
be ‘Complete’ if the length of their aligned sequence is within two standard deviations (2σ) of the BUSCO
group’s mean length (i.e. 95% expectation), otherwise they are classified as ‘Fragmented’ recoveries (Figure
S1). A BUSCO is classified as ‘Duplicated’ when multiple BUSCO-matching genes meet both the
‘expected-score’ and the ‘expected-length’ cut-offs, i.e. multiple copies of full-length orthologs are found in
the gene set being assessed. Lastly, any BUSCO without a BUSCO-matching gene that meets the ‘expected-
score’ cut-off is classified as ‘Missing’.
1.7. Training Augustus gene finding parameters
Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the
prediction parameters using the most reliable gene structures obtained from the initial set of predictions can
substantially improve the results. To train Augustus, BUSCO-matching genes classified as ‘Complete’ and
single-copy are selected to form a high-quality training dataset. The selected gene structures are extracted,
and used to build GenBank files (gff2smallgb) suitable for training Augustus (etraining). This procedure
results in the creation of genome-specific gene finding parameters; for the vast majority of genomes
evaluated, when compared to ‘generic’ gene finding parameters, these genome-specific parameters result in
substantial increases in the sensitivity and specificity of Augustus predictions, both at gene and exon levels.
A second round of Augustus gene prediction is then performed using these genome-specific parameters on
all BUSCO-matching candidate regions where initial predictions failed or did not yield a ‘Complete’
ortholog. Orthology assessment, protein length evaluations, and final classifications are then performed as
outlined above to produce the final BUSCO assessment results.
Augustus allows for the possibility of further sensitivity and specificity gains by applying multiple rounds
of metaparameter optimisation performed using OptimizeAugustus. However, this extra optimisation step
comes at the cost of generally more than double the run-time for a typical genome assembly assessment,
without large improvements in assessment sensitivity. Thus, for default genome assembly assessments, this
extra optimisation step is not performed unless specified by the user (--long mode). This option is made
available to users because although the improvements from this extra optimisation step are minimal for the
purposes of assembly assessments, they can prove valuable when using BUSCO sets to train gene predictors
for subsequent use as part of multi-evidence-based whole genome annotation pipelines.
2. BUSCO completeness versus N50 contiguity
BUSCO assessment of genome assembly completeness is designed to provide a more detailed
quantification of assembly quality than traditional measures such as scaffold N50 metrics of assembly
contiguity. Comparing BUSCO completeness with N50 contiguity for a selection of genomes ranging from
fragmented draft assemblies to chromosome-level genome assemblies reveals the low correlation (r=0.149)
between these measures (Figure S2). Thus, even fragmented assemblies with relatively low N50 values can
encode fairly complete gene sets, and some assemblies that appear to be of good quality based on contiguity
measures are not necessarily more complete in terms of expected gene content. Additionally, when assessing
gene sets, it is clear that species with very high gene counts are not necessarily the most complete, nor are
those with rather low gene counts necessarily incomplete (Waterhouse, 2015). For a typical eukaryotic draft
assembly, BUSCO assessments suggest that assemblies with N50 values on the order of 50 Kbp are capable
of yielding fairly complete gene sets.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 6 of 13
Figure S2. BUSCO completeness versus N50 contiguity. Nine outliers with N50 values above 10’000 Kbp
are not shown, each of which achieve more than 90% BUSCO completeness.
3. BUSCO versus CEGMA assessment of genome assembly completeness
The Core Eukaryotic Genes Mapping Approach (CEGMA) is a widely-used method to assess genome
assembly completeness in terms of gene content (Parra, et al., 2007; Parra, et al., 2009), but does not provide
a means for directly assessing gene sets. CEGMA employs a set of 248 conserved Core Eukaryotic Genes
(CEGs) expected to be present in any newly sequenced eukaryotic genome. The CEGs are derived from
eukaryotic KOGs (Tatusov, et al., 2003) and are composed of orthologous protein sequences from six
eukaryotic species (human, fruit fly, roundworm, thale cress, fission yeast and baker’s yeast), for which a
corresponding HMM profile is built from their multiple sequence alignments.
In order to perform a like-for-like comparison of the CEGMA and BUSCO genome assembly and gene
set assessments, a subset of 250 of the 429 eukaryote BUSCOs was selected with the lowest variations of
their ‘expected-score’ and ‘expected-length’ parameters. As the CEGMA pipeline does not perform gene set
assessments, an analysis pipeline was built to use the CEGMA HMM profiles instead of the BUSCO HMM
profiles. In addition, the pipeline employed the cut-offs that CEGMA uses to determine the presence/absence
(from the provided ‘cutoff_file’ with the cut-offs for CEGMA HMMs) and complete/partial (complete,
>70% CEG length) status of potentially orthologous matches.
Thus, BUSCO assessments of genome assemblies and gene sets were performed with normal default
options except for substituting the full eukaryote BUSCO set with a subset of only 250 in order to match the
number of CEGMA CEGs. The CEGMA assessments of genome assemblies were performed with normal
default options, and CEGMA assessments of gene sets were enabled by building a pipeline to use CEGMA
HMM profiles and cut-offs. The results for the assessments of 40 species are shown in Figure 2 of the main
text. They reveal generally consistent BUSCO assessments across highly divergent lineages from fungi to
human, with somewhat less consistent results from the CEGMA assessments (BUSCO linear regression
more closely follows the diagonal than that of CEGMA).
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 7 of 13
Linear regressions of each set, adjusted R2:
BUSCO R2 = 0.718
CEGMA R2 = 0.413
R2 = SSR / SST where SSR = ∑ (ŷi - )
2, SST = ∑ (yi - )
2
yi is the ith observed value
ŷi is the ith expected value from the best-fit line
and is the mean of y
To evaluate against the diagonal (x = y) instead of the best-fit, the expected value (ŷi) simply becomes the x
value (xi), and there is no intercept term (i.e. x = y = 0) so: R2(x=y)
= 1 – ( SSE / SST ) where SSE = ∑ (yi - ŷi)2.
BUSCO: R2(x=y)
= 1 – ( 1281.6 / 3440.5 ) = 0.63
CEGMA: R2(x=y)
= 1 – ( 5944.3 / 1936.3 ) = -2.07
4. BUSCO assessments of genomes, transcriptomes, and gene sets
The BUSCO assessment pipeline was applied to 70 available genome assemblies and their corresponding
official gene sets, as well as to 93 additional gene sets, and 96 transcriptomes. The detailed results are shown
in Table S1 in C[D],F,M,n BUSCO notation. The evaluated genome assemblies include both high quality
reference genomes (e.g. Homo sapiens), as well as de novo assemblies of non-model organisms, sampling a
wide range of different fold-coverage levels, N50 sizes, sequencing technologies, and assembly strategies.
These genomes represent the four major BUSCO lineages with 41 arthropods from 13 different orders, 3
vertebrates from 3 different orders, 11 basal metazoans, and 15 fungal species from 12 different orders. The
gene sets chosen for these assessments comprise: 41 arthropods, 26 vertebrates, 11 basal metazoans and 15
fungal species. 96 transcriptomes were also evaluated; sequences were typically derived from mRNA
extracted from different tissue types. The transcriptomes analysed cover a total of 11 fungal species (14
transcriptomes), 39 arthropods (44 transcriptomes), 18 vertebrates (28 transcriptomes) and 10 basal
metazoans (13 transcriptomes). Duplications [D] were not assessed (n.a.) for unfiltered gene sets or
transcriptomes that contained multiple transcripts of the same gene as this would lead to overestimates of
BUSCO duplications.
Table S1. Current assessment completeness metrics in BUSCO notation (C:complete [D:duplicated],
F:fragmented, M:missed, n:genes) sampling different types of data and a variety of eukaryotic species.
Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment
Ver
teb
rate
s
Homo sapiens Genome GCA_000001405.15 67,794 C:89% [D:1.5%], F:6.0%, M:4.5%, n:3023
Gene set GRCh37.75 C:99% [D:1.7%], F:0.0%, M:0.0%, n:3023
Mus musculus Genome GCA_000001635.4 52,589 C:78% [D:3.0%], F:19%, M:2.5%, n:3023
Gene set GRCm38.75 C:99% [D:2.5%], F:99%, M:0.1%, n:3023
Ornithorhyncus anatinus Genome GCF_000002275.2 991 C:55% [D:0.8%], F:25%, M:18%, n:3023
Gene set OANA5.75 C:72% [D:1.1%], F:19%, M:8.2%, n:3023
Callithrix jacchus
Gene set C_jacchus3.2.1.75 C:97% [D:2.9%], F:1.7%, M:0.8%, n:3023
Transcriptome GI:532219616 Bladder C:76% [D:17%], F:5.5%, M:18%, n:3023
Transcriptome GI:532292355 hypocampus C:79% [D:18%], F:4.5%, M:15%, n:3023
Transcriptome GI:532349506 Cortex C:34% [D:7.6%], F:34%, M:64%, n:3023
Transcriptome GI:532452938 S. muscle C:69% [D:13%], F:6.0%, M:24%, n:3023
Transcriptome GI:532524775 Cerebellum C:76% [D:19%], F:5.1%, M:18%, n:3023
Pan troglodytes
Gene set CHIMP2.14.75 C:96% [D:0.5%], F:1.2%, M:1.9%, n:3023
Transcriptome GI:410228237adipose SC C:75% [D:15%], F:3.8%, M:20%, n:3023
Transcriptome GI:410308999 Fibroblast C:75% [D:16%], F:3.7%, M:21%, n:3023
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 8 of 13
Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment
Transcriptome GI:410268357 Endothelium C:75% [D:15%], F:3.5%, M:21%, n:3023
Anolis carolinensis
Gene set AnoCar2.0.75 C:89% [D:2.6%], F:6.8%, M:3.4%, n:3023
Transcriptome GI:614142443 Skeletal C:58% [D:14%], F:8.7%, M:32%, n:3023
Transcriptome GI:464801713 Whole C:27% [D:15%], F:18%, M:53%, n:3023
Latimeria chalmnae Transcriptome GI:387756559 Muscle C:37% [D:6.9%], F:11%, M:50%, n:3023
Rana clamitans Transcriptome GI:451274083 Unknown C:21% [D:0.3%], F:13%, M:65%, n:3023
Pseudoacris regilla Transcriptome GI:451272305 Unknown C:20% [D:0.4%], F:16%, M:63%, n:3023
Salmo salar Transcriptome GI:666988260 Mixed C:19% [D:7.8%], F:6.6%, M:74%, n:3023
Oreochromis niloticus Transcriptome GI:555682626 Spleen C:39% [D:0.4%], F:16%, M:44%, n:3023
Ameiurus nebulosus Transcriptome GI:472819489 Unknown C:7.3% [D:0.2%], F:10%, M:82%, n:3023
Ursus maritimus Transcriptome GI:510063642 Fat C:50% [D:29%], F:5.5%, M:44%, n:3023
Tripterygion delaisi Transcriptome GI:572723144 Brain C:35% [D:13%], F:17%, M:47%, n:3023
Atractaspis aterrima Transcriptome GI:673456880 Venom C:0.7% [D:0.0%], F:1.0%, M:98%, n:3023
Transcriptome GI:673404158 Venom C:4.4% [D:0.5%], F:6.8%, M:88%, n:3023
Latimeria menadoensis Transcriptome GI:559559797 Testis C:71% [D:15%], F:6.5%, M:22%, n:3023
Hynobius chinensis Transcriptome GI:570932341 Unknown C:59% [D:7.3%], F:13%, M:26%, n:3023
Carduelis chloris Transcriptome GI:617996660 Blood C:31% [D:0.2%], F:12%, M:55%, n:3023
Maylandia zebra Transcriptome GI:614241491 Kidney C:64% [D:15%], F:8.7%, M:26%, n:3023
Chinchilla lanigera Transcriptome GI:618625375 Trachea C:80% [D:44%], F:5.7%, M:14%, n:3023
Ailuropoda melanoleuca Gene set ailMel1.75 C:97% [D:1.3%], F:1.8%, M:0.3%, n:3023
Bos taurus Gene set UMD3.175 C:97% [D:1.3%], F:1.6%, M:0.5%, n:3023
Danio rerio Gene set Zv9.75 C:95% [D:8.3%], F:3.2%, M:1.7%, n:3023
Felis catus Gene set Felis_catus_6.2.75 C:96% [D:1.2%], F:2.8%, M:0.5%, n:3023
Ficedula albicollis Gene set FicAlb_1.4.75 C:88% [D:2.0%], F:4.1%, M:7.8%, n:3023
Gallus gallus Gene set Galga4.75 C:90% [D:2.4%], F:3.5%, M:6.0%, n:3023
Gorilla gorilla Gene set gorGor3.1.75 C:96% [D:2.6%], F:1.7%, M:2.1%, n:3023
Loxodonta africana Gene set loxAfr3.75 C:96% [D:1.5%], F:2.3%, M:1.0%, n:3023
Macaca mulatta Gene set MMUL_1.75 C:94% [D:2.0%], F:4.5%, M:0.9%, n:3023
Monodelphis domestica Gene set BROADO5.75 C:95% [D:4.0%], F:2.3%, M:1.6%, n:3023
Mustela putorius Gene set MusPutFur1.0.75 C:97% [D:1.4%], F:1.7%, M:1.0%, n:3023
Oreochromis niloticus Gene set Orenil1.0.75 C:96% [D:5.1%], F:1.4%, M:2.5%, n:3023
Oryctolagus cuniculus Gene set OryCun2.0.75 C:93% [D:2.7%], F:3.0%, M:3.2%, n:3023
Oryzias latipes Gene set MEDAKA1.75 C:83% [D:3.2%], F:5.4%, M:11%, n:3023
Pongo abelii Gene set PPYG2.75 C:95% [D:1.1%], F:3.3%, M:1.1%, n:3023
Sus scrofa Gene set Sscrofa10.2.75 C:83% [D:7.4%], F:6.8%, M:10%, n:3023
Taeniopygia guttata Gene set taeGut3.2.4.75 C:81% [D:3.2%], F:7.5%, M:11%, n:3023
Takifugu rubripes Gene set FUGU4.75 C:89% [D:5.2%], F:3.5%, M:7.3%, n:3023
Xenopus tropicalis Gene set JGI_4.2.75 C:93% [D:3.4%], F:3.5%, M:2.5%, n:3023
Xiphophorus maculatus Gene set Xipmac4.4.2.75 C:93% [D:3.6%], F:4.7%, M:1.3%, n:3023
Art
hro
po
ds
Acromyrmex echinatior Genome Aech_2.0 1,110 C:91% [D:2.6%], F:8.0%, M:0.6%, n:2675
Gene set Aech_OGS_v3.8 C:96% [D:8.8%], F:2.8%, M:0.5%, n:2675
Acyrtosiphon pisum Genome GCA_000142985.2 86 C:72% [D:6.1%], F:15%, M:12%, n:2675
Gene set GCA_000142985.2.22 C:89% [D:14%], F:4.1%, M:5.9%, n:2675
Aedes aegypti Genome AaegL3 1,547 C:86% [D:13%], F:10%, M:3.2%, n:2675
Gene set AaegL3.2 C:93% [D:17%], F:3.6%, M:3.0%, n:2675
Anopheles gambiae Genome AgamP4 49,364 C:93% [D:4.7%], F:4.1%, M:2.5%, n:2675
Gene set AgamP4.2 C:97% [D:10%], F:1.4%, M:0.8%, n:2675
Apis mellifera Genome Amel_v4.5 997 C:93% [D:2.9%], F:5.1%, M:0.9%, n:2675
Gene set Amel_OGS_v3.2 C:97% [D:9%], F:2.1%, M:0.1%, n:2675
Atta cephalotes Genome Acep 1.0 5,154 C:89% [D:2.6%], F:8.7%, M:1.3%, n:2675
Gene set Acep OGS v1.2 C:91% [D:7.7%], F:7.5%, M:0.5%, n:2675
Bombyx mori Genome GCA_000151625.1 4,008 C:73% [D:2.2%], F:17%, M:8.3%, n:2675
Gene set GLEAN set C:75% [D:7.0%], F:14%, M:10%, n:2675
Camponotus floridanus Genome Cflor_v3.3 451 C:92% [D:3.1%], F:6.6%, M:0.5%, n:2675
Gene set Cflor_OGS_v3.3 C:95% [D:8.7%], F:3.9%, M:0.4%, n:2675
Danaus plexippus Genome DanPle_1.0.22 52 C:83% [D:8.6%], F:11%, M:4.3%, n:2675
Gene set DanPle_1.0.22 C:86% [D:9.0%], F:9.5%, M:3.7%, n:2675
Daphnia pulex Genome GCA_000187875.1 642 C:83% [D:3.9%], F:11%, M:5.1%, n:2675
Gene set GCA_000187875.1.22 C:84% [D:10%], F:11%, M:4.0%, n:2675
Dendroctonus ponderosa Genome GCA_000355655.2 628 C:77% [D:6.1%], F:15%, M:7.2%, n:2675
Gene set GCA_000355655.2.22 C:82% [D:11%], F:10%, M:6.6%, n:2675
Drosophila anannasse Genome Dana_r1.3 4,599 C:96% [D:3.7%], F:1.9%, M:1.9%, n:2675
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 9 of 13
Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment
Gene set Dana_r1.3 C:98% [D:9.6%], F:0.8%, M:0.1%, n:2675
Drosophila erecta Genome Dere_r1.3 18,748 C:98% [D:4.7%], F:1.4%, M:0.4%, n:2675
Gene set Dere_r1.3 C:99% [D:9.3%], F:0.2%, M:0.1%, n:2675
Drosophila grimshawi Genome Dgri_r1.3 8,399 C:97% [D:6.2%], F:2.2%, M:0.4%, n:2675
Gene set Dgri_r1.3 C:99% [D:11%], F:0.4%, M:0.0%, n:2675
Drosophila melanogaster Genome Dmel_r5.55 23,011 C:98% [D:6.4%], F:0.6%, M:0.3%, n:2675
Gene set Dmel_r5.55 C:99% [D:9.1%], F:0.2%, M:0.0%, n:2675
Drosophila mojavensis Genome Dmoj_r1.3 24,764 C:97% [D:4.4%], F:2.2%, M:0.4%, n:2675
Gene set Dmoj_r1.3 C:99% [D:9.6%], F:0.8%, M:0.1%, n:2675
Drosophila persimilis Genome Dper_r1.3 1,869 C:93% [D:5.6%], F:5.8%, M:0.8%, n:2675
Gene set Dper_r1.3 C:93% [D:9.3%], F:5.6%, M:0.7%, n:2675
Drosophila pseudobscura Genome Dpse_r3.1 12,541 C:96% [D:6.3%], F:2.2%, M:1.1%, n:2675
Gene set Dpse_r3.1 C:98% [D:11%], F:0.6%, M:0.6%, n:2675
Drosophila sechelia Genome Dsec_r1.3 2,123 C:96% [D:5.1%], F:2.8%, M:0.7%, n:2675
Gene set Dsec_r1.3 C:96% [D:8.9%], F:3.0%, M:0.3%, n:2675
Drosophila simulans Genome Dsim_r1.4 857 C:85% [D:4.6%], F:9.0%, M:5.0%, n:2675
Gene set Dsim_r1.4 C:84% [D:7.6%], F:6.9%, M:8.0%, n:2675
Drosophila virilis Genome Dvir_r1.2 10,161 C:96% [D:5.2%], F:2.4%, M:0.6%, n:2675
Gene set Dvir_r1.2 C:99% [D:9.6%], F:0.7%, M:0.1%, n:2675
Drosophila willistoni Genome Dwil_r1.3 4,511 C:97% [D:5.5%], F:1.7%, M:0.4%, n:2675
Gene set Dwil_r1.3 C:99% [D:10%], F:0.6%, M:0.2%, n:2675
Drosophila yakuba Genome Dyak_r1.3 21,770 C:97% [D:6.5%], F:1.5%, M:0.7%, n:2675
Gene set Dyak_r1.3 C:98% [D:10%], F:0.8%, M:0.2%, n:2675
Harpegnathos saltator Genome Hsal_v3.3 601 C:89% [D:3.2%], F:9.6%, M:1.1%, n:2675
Gene set Hsal_OGS_v3.3 C:95% [D:9.0%], F:3.8%, M:0.7%, n:2675
Heliconius melpomene Genome Hmel_v1.22 194 C:77% [D:2.0%], F:11%, M:10%, n:2675
Gene set Hmel_v1.22 C:74% [D:6.7%], F:14%, M:11%, n:2675
Ixodes scapularis Genome IscaW1 76 C:58% [D:1.7%], F:21%, M:19%, n:2675
Gene set IscaW1.3 C:69% [D:6.6%], F:23%, M:7.1%, n:2675
Linepithema humile Genome Lhum_v1.0 1,402 C:92% [D:3.3%], F:7.0%, M:0.6%, n:2675
Gene set Lhum_OGS_v1.2 C:95% [D:8.8%], F:4.0%, M:0.1%, n:2675
Lutzomyia longipalpis Genome Llonj1.1 85 C:73% [D:6.3%], F:10%, M:16%, n:2675
Gene set Llonj1.1 C:66% [D:9.7%], F:13%, M:20%, n:2675
Manduca sexta Genome GCA_000262585.1 664 C:81% [D:4.4%], F:12%, M:6.1%, n:2675
Gene set OGS2_20140407 C:80% [D:10%], F:10%, M:8.2%, n:2675
Megaselia scalaris Genome Mscal_v1.22 1 C:16% [D:0.6%], F:21%, M:61%, n:2675
Gene set Mscal_v1.22 C:21% [D:1.4%], F:20%, M:58%, n:2675
Metaseiulus occidentalis Genome Mocc_1.0 896 C:76% [D:4.9%], F:12%, M:10%, n:2675
Gene set Mocc_1.0 C:82% [D:14%], F:10%, M:6.5%, n:2675
Musca domestica Genome v2.0.2 226 C:91% [D:4.3%], F:5.3%, M:2.7%, n:2675
Gene set v2.0.2 C:97% [D:29%], F:2.3%, M:0.5%, n:2675
Nasonia vitripennis Genome Nvit_v1.0 698 C:91% [D:6.0%], F:5.1%, M:3.2%, n:2675
Gene set Nvit_OGS_v1.2 C:94% [D:10%], F:4.0%, M:1.1%, n:2675
Pediculus humanus Genome PhumU2 497 C:92% [D:3.9%], F:6.1%, M:1.6%, n:2675
Gene set PhumU2.1 C:93% [D:9.1%], F:4.9%, M:1.3%, n:2675
Phlebotomus papatasi Genome Ppapi1.1 0.87 C:33% [D:3.2%], F:33%, M:33%, n:2675
Gene set Ppapi1.1 C:54% [D:6.1%], F:20%, M:25%, n:2675
Pogonomyrmex barbatus Genome Pbar_v1.0 819 C:90% [D:2.9%], F:8.5%, M:0.7%, n:2675
Gene set Pbar_OGS_v1.2 C:93% [D:8.2%], F:6.5%, M:0.3%, n:2675
Solenopsis invicta Genome Sinv_v1.0 558 C:74% [D:2.4%], F:19%, M:6.3%, n:2675
Gene set Sinv_OGS_v2.2.3 C:80% [D:6.5%], F:14%, M:5.4%, n:2675
Rhodnius prolixus Genome RproC1 847 C:85% [D:2.5%], F:12%, M:2.5%, n:2675
Gene set RprocC1.2 C:74% [D:8.3%], F:9.1%, M:16%, n:2675
Strigamia maritima Genome Smar1.22 139 C:84% [D:5.9%], F:12%, M:3.2%, n:2675
Gene set GCA_000239435.1.22 C:87% [D:12%], F:8.3%, M:4.6%, n:2675
Tetranychus urticae Genome GCA_000239435.1 2,993 C:61% [D:4.5%], F:12%, M:25%, n:2675
Gene set GCA_000239435.1.22 C:69% [D:11%], F:9.6%, M:20%, n:2675
Tribolium castaneum Genome Tcas3.22 19,135 C:95% [D:5.8%], F:3.9%, M:0.8%, n:2675
Gene set Tcas_OGS_v2 C:95% [D:10%], F:3.0%, M:1.3%, n:2675
Acanthoscurria geniculata Transcriptome GI:598795695 whole C:65% [D:n.a.], F:13%, M:20%, n:2675
Anopheles sinensis Transcriptome GI:656597267 unknown C:36% [D:n.a.], F:22%, M:41%, n:2675
Anthonomus grandis Transcriptome GI:562777735 whole C:18% [D:n.a.], F:16%, M:65%, n:2675
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 10 of 13
Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment
Bactrocera dorsalis Transcriptome GI:618068638 unknown C:87% [D:n.a.], F:5.9%, M:6.4%, n:2675
Belgica antartica Transcriptome GI:418280542 whole C:79% [D:n.a.], F:10%, M:9.8%, n:2675
Calanus finmarchicus Transcriptome GI:592958556 unknown C:84% [D:n.a.], F:7.3%, M:8.5%, n:2675
Transcriptome GI:647215886 unknown C:78% [D:n.a.], F:11%, M:10%, n:2675
Ceratitis capitata Transcriptome GI:577749858 whole C:87% [D:n.a.], F:7.3%, M:5.6%, n:2675
Cherax quadricarinatus Transcriptome GI:512174511 hypodermis C:7.8% [D:n.a.], F:7.6%, M:84%, n:2675
Corydalinae sp. Transcriptome GI:661070030 whole C:14% [D:n.a.], F:20%, M:64%, n:2675
Delia antiqua Transcriptome GI:604701913 whole C:55% [D:n.a.], F:15%, M:28%, n:2675
Dendroctonus frontalis Transcriptome GI:452943093 whole C:56% [D:n.a.], F:22%, M:21%, n:2675
Drosophila ercepeae Transcriptome GI:570540147 unknown C:18% [D:n.a.], F:16%, M:65%, n:2675
Drosophila malerkotliana m. Transcriptome GI:570549742 unknown C:19% [D:n.a.], F:16%, M:64%, n:2675
Drosophila malerkotliana p. Transcriptome GI:570523813 unknown C:29% [D:n.a.], F:24%, M:45%, n:2675
Drosophila merina Transcriptome GI:570504412 unknown C:25% [D:n.a.], F:20%, M:53%, n:2675
Drosophila miranda Transcriptome GI:645592147 unknown C:91% [D:n.a.], F:4.2%, M:4.0%, n:2675
Drosophila pseudoananassae n. Transcriptome GI:570451470 unknown C:6.2% [D:n.a.], F:21%, M:72%, n:2675
Drosophila pseudoananassae p. Transcriptome GI:570485056 whole C:8.5% [D:n.a.], F:21%, M:70%, n:2675
Drosophila serrata Transcriptome GI:480512000 unknown C:40% [D:n.a.], F:22%, M:36%, n:2675
Echinogammarus veneris Transcriptome GI:595402945 unknown C:20% [D:n.a.], F:8.0%, M:71%, n:2675
Enallagma hageni Transcriptome GI:459275420 total C:6.9% [D:n.a.], F:7.6%, M:85%, n:2675
Folsomia candida Transcriptome GI:570625125 unknown C:47% [D:n.a.], F:14%, M:38%, n:2675
Hyalella azteca Transcriptome GI:510074665 unknown C:5.9% [D:n.a.], F:3.8%, M:90%, n:2675
Transcriptome GI:510092454 unknown C:6.6% [D:n.a.], F:5.4%, M:87%, n:2675
Ips typographus Transcriptome GI:459277393 antenna C:19% [D:n.a.], F:20%, M:59%, n:2675
Ixodes scapularis Transcriptome GI:604952323 Synganglion C:27% [D:n.a.], F:26%, M:46%, n:2675
Ixodes ricinus Transcriptome GI:556088131 salivary C:77% [D:n.a.], F:8.4%, M:13%, n:2675
Latrodectus hesperus Transcriptome GI:618730332 unknown C:82% [D:n.a.], F:8.4%, M:9.3%, n:2675
Melita plumosa Transcriptome GI:510208131 whole C:6.4% [D:n.a.], F:6.3%, M:87%, n:2675
Mengenilla moldrzyki Transcriptome GI:660742704 whole C:9.5% [D:n.a.], F:13%, M:76%, n:2675
Musca domestica Transcriptome GI:604923024 unknown C:64% [D:n.a.], F:19%, M:15%, n:2675
Nannochorista philpotti Transcriptome GI:661012745 whole C:31% [D:n.a.], F:31%, M:37%, n:2675
Nilaparvata lugens Transcriptome GI:672467144 salivary C:74% [D:n.a.], F:12%, M:12%, n:2675
Orchesella cincta Transcriptome GI:570587022 unknown C:44% [D:n.a.], F:11%, M:44%, n:2675
Polistes canadensis Transcriptome GI:452055806 multiple C:51% [D:n.a.], F:22%, M:26%, n:2675
Pontastacus leptodactylus Transcriptome GI:556694752 hypodermis C:73% [D:n.a.], F:11%, M:14%, n:2675
Transcriptome GI:557011125 hepatopancreas C:44% [D:n.a.], F:12%, M:43%, n:2675
Priacma serrata Transcriptome GI:661240973 Unknown C:11% [D:n.a.], F:16%, M:72%, n:2675
Spodoptera exigua Transcriptome GI:548816146 unknown C:29% [D:n.a.], F:14%, M:55%, n:2675
Stegodyphus mimosarum Transcriptome GI:598904898 whole C:14% [D:n.a.], F:16%, M:68%, n:2675
Teleopsis dalmanni Transcriptome GI:615270444 whole C:92% [D:n.a.], F:6.0%, M:1.6%, n:2675
Teleopsis whitei Transcriptome GI:619803922 whole C:90% [D:n.a.], F:4.6%, M:5.3%, n:2675
Themira biloba Transcriptome GI:654236640 wildtype C:71% [D:n.a.], F:16%, M:11%, n:2675
Oth
er m
etaz
oan
s
Brugia malayi Genome GCA_000002995.3 37 C:60% [D:1.5%], F:13%, M:25%, n:843
Gene set B_malayi_3.0.22 C:77% [D:9.7%], F:5.1%, M:17%, n:843
Caenorhabditis briggsae Genome CB4 17,512 C:76% [D:2.9%], F:7.5%, M:16%, n:843
Gene set CB4.22 C:85% [D:11%], F:3.5%, M:11%, n:843
Caenorhabditis elegans Genome GCA_000002985.3 17,494 C:85% [D:6.9%], F:2.8%, M:11%, n:843
Gene set WBcel235.22 C:90% [D:11%], F:1.7%, M:7.5%, n:843
Caenorhabditis japonica Genome GCA_000147155.1 94 C:63% [D:4.8%], F:13%, M:22%, n:843
Gene set C_japonica-7.0.1.22 C:67% [D:9.4%], F:11%, M:20%, n:843
Helobdella robusta Genome GCA_000326865.1 3,060 C:74% [D:3.4%], F:10%, M:14%, n:843
Gene set GCA_000326865.1.22 C:85% [D:12%], F:9.9%, M:4.2%, n:843
Loa loa Genome GCA_00018385.2 174 C:80% [D:6.6%], F:2.4%, M:17%, n:843
Gene set Loa_loa_v3.22 C:81% [D:8.5%], F:4.5%, M:14%, n:843
Lottia gigantea Genome GCA_00032785.1 1,870 C:89% [D:2.3%], F:4.3%, M:5.8%, n:843
Gene set GCA_00032785.1.22 C:90% [D:13%], F:7.8%, M:2.1%, n:843
Nematostella vectensis Genome GCA_000209225.1 472 C:78% [D:3.5%], F:10%, M:10%, n:843
Gene set GCA_000209225.1.22 C:83% [D:15%], F:14%, M:2.8%, n:843
Schistosoma mansoni Genome GCA_000237925.2 34,464 C:56% [D:4.3%], F:8.3%, M:34%, n:843
Gene set ASM2379v2.22 C:65% [D:7.8%], F:8.3%, M:26%, n:843
Strongylocentrotus purpuratus Genome GCA_000002235.2 167 C:87% [D:6.5%], F:7.8%, M:4.9%, n:843
Gene set GCA_000002235.2.22 C:83% [D:19%], F:15%, M:0.7%, n:843
Trichoplax adhaerens Genome GCA_000150275.1 5,978 C:81% [D:1.1%], F:7.8%, M:10%, n:843
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 11 of 13
Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment
Gene set ASM1507v1.22 C:85% [D:11%], F:12%, M:2.3%, n:843
Ancylostoma ceylanicum Transcriptome GI:595744344 Unknown C:16% [D:n.a.], F:38%, M:44%, n:843
Aplysia californica
Transcriptome GI:613602134 chemokine C:88% [D:n.a.], F:8.1%, M:2.8%, n:843
Transcriptome GI:614063388 Gills C:88% [D:n.a.], F:8.4%, M:3.5%, n:843
Transcriptome GI:606015213 Heart C:77% [D:n.a.], F:12%, M:9.3%, n:843
Transcriptome GI:594457164 Salivary C:41% [D:n.a.], F:23%, M:34%, n:843
Apostichopus japonicus Transcriptome GI:638469663 Unknown C:68% [D:n.a.], F:24%, M:6.9%, n:843
Asterias amurensis Transcriptome GI:638532954 Unknown C:59% [D:n.a.], F:28%, M:11%, n:843
Bithynia siamensis goniomphalos Transcriptome GI:480970007 Unknown C:57% [D:n.a.], F:24%, M:17%, n:843
Evechinus chloroticus Transcriptome GI:559461775 Unknown C:92% [D:n.a.], F:5.3%, M:2.6%, n:843
Henricia sp. AR-2014 Transcriptome GI:638872012 Unknown C:90% [D:n.a.], F:7.9%, M:1.1%, n:843
Patiria miniata Transcriptome GI:638728087 Ovary C:88% [D:n.a.], F:10%, M:1.1%, n:843
Patiria pectinifera Transcriptome GI:638651248 Unknown C:80% [D:n.a.], F:18%, M:1.6%, n:843
Procotyla flyviatilis Transcriptome GI:528026207 Unknown C:54% [D:n.a.], F:18%, M:26%, n:843
Fu
ng
i
Ashbya gossypii Genome GCA_000091025.4 1,476 C:95% [D:4.5%], F:1.8%, M:2.9%, n:1438
Gene set C:95% [D:7.3%], F:3.8%, M:0.9%, n:1438
Aspergillus nidulans Genome GCA_000011425.1 3,704 C:98% [D:1.8%], F:0.9%, M:0.2%, n:1438
Gene set C:95% [D:11%], F:2.8%, M:1.8%, n:1438
Cryptococcus neoformnas Genome GCA_000091045.1 1,438 C:92% [D:5.4%], F:2.5%, M:4.8%, n:1438
Gene set C:90% [D:7.1%], F:5.9%, M:3.1%, n:1438
Gibberella zeae Genome GCA_000240135.2 5,350 C:98% [D:1.3%], F:1.3%, M:0.2%, n:1384
Gene set C:97% [D:11%], F:2.0%, M:0.2%, n:1384
Komagataella pastoris Genome GCA_000027005.1 2,394 C:93% [D:5.0%], F:4.5%, M:2.0%, n:1438
Gene set C:93% [D:8.5%], F:3.8%, M:2.7%, n:1438
Neurospora crassa Genome GCA_000182925.1 6,000 C:98% [D:6.5%], F:0.6%, M:0.6%, n:1438
Gene set C:97% [D:10%], F:1.5%, M:0.6%, n:1438
Phaeosphaeria nodorum Genome GCA_000146915.1 1,045 C:96% [D:6.0%], F:3.1%, M:0.2%, n:1438
Gene set C:91% [D:9.7%], F:8.4%, M:0.4%, n:1438
Puccinia graminis Genome GCA_000149925.1 964 C:63% [D:5.6%], F:20%, M:15%, n:1438
Gene set C:85% [D:11%], F:8.0%, M:6.3%, n:1438
Saccharomyces cerevisiae Genome GCA_000146045.2 924 C:96% [D:5.2%], F:0.4%, M:2.7%, n:1438
Gene set C:98% [D:8.6%], F:1.1%, M:0%, n:1438
Schizosaccharomyces pombe Genome GCA_000002945.2 4,539 C:89% [D:3.8%], F:2.7%, M:7.7%, n:1438
Gene set C:90% [D:9.5%], F:5.7%, M:3.3%, n:1438
Sclerotina sclerotiorum Genome GCA_000146945.1 1,625 C:70% [D:3.5%], F:3.8%, M:25%, n:1438
Gene set C:67% [D:8%], F:7.4%, M:25%, n:1438
Tuber melanosporum Genome GCA_000151645.1 638 C:95% [D:5.0%], F:4.1%, M:0.6%, n:1438
Gene set C:91% [D:9.0%], F:6.2%, M:2.3%, n:1438
Ustilago maydis Genome GCA_000328475.1 127 C:92% [D:5.9%], F:3.1%, M:4.4%, n:1438
Gene set C:88% [D:7.5%], F:6.6%, M:5.0%, n:1438
Verticillium dahliae Genome GCA_000150675.1 1,273 C:95% [D:4.4%], F:3.5%, M:0.9%, n:1438
Gene set C:94% [D:9.4%], F:4.5%, M:0.9%, n:1438
Yarrowia lipolytica Genome GCA_000002525.1 3,633 C:97% [D:5.4%], F:2.1%, M:0.6%, n:1438
Gene set C:96% [D:8.8%], F:2.9%, M:0.6%, n:1438
Agaricus subrufescens Transcriptome GI:645683639 Unknown C:7.7% [D:n.a.], F:28%, M:63%, n:1438
Armillaria ostoyae Transcriptome GI:480500433 RNA1 C:45% [D:n.a.], F:42%, M:11%, n:1438
Hypsizygus marmoreus Transcriptome GI:612225315 Unknown C:59% [D:n.a.], F:34%, M:6.4%, n:1138
Ophiocordyceps sinensis Transcriptome GI:630075070 Unknown C:38% [D:n.a.], F:36%, M:24%, n:1438
Phakopsora pachyrhizi Transcriptome GI:452772923 Thai1 C:9.3% [D:n.a.], F:12%, M:78%, n:1438
Puccinia striiformis f.sp. tritici
Transcriptome GI:509494464 PST C:32% [D:n.a.], F:35%, M:32%, n:1438
Transcriptome GI:509507311 Haustorium C:22% [D:n.a.], F:33%, M:43%, n:1438
Transcriptome GI:509515198 Spore C:17% [D:n.a.], F:32%, M:49%, n:1438
Pyrenochaeta lycopersici Transcriptome GI:589143963 unknown C:94% [D:n.a.], F:4.8%, M:0.1%, n:1438
Spraguea lophii Transcriptome GI:520759716 Spore C:6.4% [D:n.a.], F:11%, M:82%, n:1438
Termitomyces clypeatus Transcriptome GI:595370870 treated C:95% [D:n.a.], F:4.3%, M:0.0%, n:1438
Transcriptome GI:595351039 untreated C:91% [D:n.a.], F:7.5%, M:1.1%, n:1438
Trametes sanguinea Transcriptome GI:511189810 BAFC2126 C:18% [D:n.a.], F:30%, M:50%, n:1438
Uromyces appendiculatus Transcriptome GI:452898896 SWBR1 C:34% [D:n.a.], F:25%, M:39%, n:1438
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 12 of 13
5. BUSCO and CEGMA analysis run-times
The total run-times of default-parameter BUSCO and CEGMA assessments of genome assemblies and
gene sets were evaluated on the analysis on representative species from different metazoan lineages (Table
S2). All analyses were performed using 4 CPUs with up to 8 GB of RAM. BUSCO assessments were
performed using the eukaryote and metazoan sets, as well as the largest specific set for each species.
Table S2. BUSCO and CEGMA assessment run-times on four representative species.
Species Dataset Analysis Run-time
Drosophila melanogaster
Genome, 180 Mbp
2’675 arthropod BUSCOs 7.6h
843 metazoan BUSCOs 3.2h
429 eukaryote BUSCOs 1.4h
250 eukaryote BUSCOs 0.81h
248 CEGMA genes 2.5h
Gene set, 13’918
2’675 arthropod BUSCOs 1.4h
843 metazoan BUSCOs 0.5h
429 eukaryote BUSCOs 0.36h
250 eukaryote BUSCOs 0.15h
248 CEGMA genes N/A
Heliconius melpomene
Genome, 269 Mbp
2’675 arthropod BUSCOs 8.1h
843 metazoan BUSCOs 3.6h
429 eukaryote BUSCOs 0.91h
250 eukaryote BUSCOs 0.58h
248 CEGMA genes 5.7h
Gene set, 12’669
2’675 arthropod BUSCOs 0.35h
843 metazoan BUSCOs 0.18h
429 eukaryote BUSCOs 0.12h
250 eukaryote BUSCOs 0.1h
248 CEGMA genes N/A
Homo sapiens
Genome, 3’381 Mbp
3’023 vertebrate BUSCOs 29h
843 metazoan BUSCOs 13h
429 eukaryote BUSCOs 6.5h
250 eukaryote BUSCOs 2.8h
248 CEGMA genes 25.3h
Gene set, 20’364
3’023 vertebrate BUSCOs 2.6h
843 metazoan BUSCOs 1.2h
429 eukaryote BUSCOs 0.5h
250 eukaryote BUSCOs 0.21h
248 CEGMA genes N/A
Caenorhabditis elegans
Genome, 100 Mbp
843 metazoan BUSCOs 5.3h
429 eukaryote BUSCOs 1.36h
250 eukaryote BUSCOs 0.88h
248 CEGMA genes 1.7h
Gene set, 20’447
843 metazoan BUSCOs 0.5h
429 eukaryote BUSCOs 0.3h
250 eukaryote BUSCOs 0.1h
248 CEGMA genes N/A
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 13 of 13
6. References
Camacho, C., et al. (2009) BLAST+: architecture and applications, BMC Bioinformatics, 10, 421.
Eddy, S.R. (2011) Accelerated Profile HMM Searches, PLoS Comput Biol, 7, e1002195.
Keller, O., et al. (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments,
Bioinformatics, 27, 757-763.
Kriventseva, E.V., et al. (2014) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free
software, Nucleic Acids Res.
Mende, D.R., et al. (2013) Accurate and universal delineation of prokaryotic species, Nat Methods, 10, 881-884.
Parra, G., Bradnam, K. and Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic
genomes, Bioinformatics, 23, 1061-1067.
Parra, G., et al. (2009) Assessing the gene space in draft genomes, Nucleic Acids Res, 37, 289-297.
Sievers, F. and Higgins, D.G. (2014) Clustal Omega, accurate alignment of very large numbers of sequences,
Methods Mol Biol, 1079, 105-116.
Tatusov, R., et al. (2003) The COG database: an updated version includes eukaryotes., BMC Bioinformatics, 4, 41.
Waterhouse, R.M. (2015) A maturing understanding of the composition of the insect gene repertoire, Current
Opinion in Insect Science, 1.
Waterhouse, R.M., et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs, Nucleic
Acids Research, 41, D358-D365.
Waterhouse, R.M., Zdobnov, E.M. and Kriventseva, E.V. (2011) Correlating Traits of Gene Retention, Sequence
Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi, Genome Biology and
Evolution, 3, 75-86.