1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...
Transcript of 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...
1
Supplemental Information 1
2
Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 3
and gene re-annotation 4
5
Chengcheng Cai1, Xiaobo Wang
1, Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng 6
Cheng*, and Xiaowu Wang
* 7
8
Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, 9
Zhongguancun Nandajie No.12, Haidian district, Beijing 100081, P.R. China. 10
1These authors contributed equally to this article. 11
*Corresponding authors. 12
13
Contents 14
Supplemental Methods 15
Supplemental Tables 16
Supplemental Figures 17
Supplemental References 18
19
20
2
Supplemental Methods 21
22
Genome de novo assembly 23
55 Gb (114×) paired-end reads data were generated from two libraries with insert size 24
of 250 bp. The read length is 150 bp. 8.7 Gb (18×) reads data were generated from 25
two mate-paired libraries with insert size of 20 Kb and 40 Kb, respectively. These 26
data were all produced using the Illumina Hiseq2500 sequencing platform. 27
Furthermore, 6.5 Gb (13.4×) single-molecule sequencing data were generated (PacBio 28
reads) with an average length of 12 Kb. In addition to these newly generated data, all 29
datasets used to assemble B. rapa genome V1.5 (Wang et al., 2011) were also 30
incorporated in this genome re-assembly. 31
Low quality reads of Illumina raw data were filtered as follows: 1) read 32
containing “N”; 2) read whose average Phred-like score was less than 20; 3) trimmed 33
read shorter than 50 bp, after a test of trimming the 3′ terminal nucleotides whose 34
Phred-like score is less than 13. Duplicated reads generated from one amplicon were 35
removed to keep only one copy. Meanwhile, the sequencing errors in PacBio reads 36
were corrected using the 150 bp paired-end Illumina reads using tool LoRDEC 37
(Salmela and Rivals, 2014) with parameters “-k 19 -s 3.” Those clean reads were then 38
used for genome assembly. 39
In the process of genome assembly, first, all the 150 bp paired-end reads (114×) 40
were used to assemble contigs by SOAPDenovo2 (Luo et al., 2012) with k-mer = 91 41
bp. Next, the large insert-size mate-paired libraries (>2 Kb) used in the previous 42
assembly plus two newly generated libraries (20 Kb and 40 Kb) were used to link 43
contigs into scaffolds using SSPACE (Boetzer et al., 2011). Then, publicly available 44
BAC-ends sequences were mapped to scaffolds using BLAST. These scaffolds were 45
further linked into super-scaffolds according to the paired relationships of BAC-ends 46
sequences. Finally, all the paired-end Illumina reads from short insert size libraries 47
(<1 Kb) were used to close gaps in these super-scaffolds using GapCloser (Luo et al., 48
2012) with default parameters. The single-molecule PacBio reads were further used to 49
close gaps in these super-scaffolds using PBjelly_V15.2.20 (English et al., 2012) with 50
3
parameters “--minGap 1 -minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 51
-maxScore -500 -nproc 5 -noSplitSupreads”. 52
The quality of the genome V1.5 and V2.0 were validated through CEGMA 53
analysis (Parra et al., 2007). 458 core eukaryotic genes (CEG database) were 54
BLASTed to the genome assembly, which showed hits of 455 (99.34%) and 454 55
(99.13%) CEG proteins for all 458 genes in CEG in the genome V1.5 and V2.0, 56
respectively. The genomes were also validated by matching B. rapa ESTs 57
downloaded from NCBI, which showed that 99.16% and 99.34% ESTs to be 58
supported by the assembled genome V1.5 and V2.0, respectively. 59
60
Construction of pseudo-chromosomes 61
To correct mis-assembled scaffolds and assign corrected scaffolds to ten 62
chromosomes of B. rapa, two genetic maps were constructed for a previously reported 63
RILs (recombinant inbred lines) population (Yu et al., 2013) and a FIDH (F1 doubled 64
haploid) population of B. rapa. The RILs population contains 150 recombinant inbred 65
lines (RILs) derived from a cross between a heading Chinese cabbage (B. rapa ssp. 66
pekinensis cv Bre) and a non-heading B. rapa (B. rapa ssp. chinensis cv Wut) (Yu et 67
al., 2013). Low-depth sequencing data of RILs population were downloaded from 68
NCBI (http://www.ncbi.nlm.nih.gov/). The F1DH population included 120 F1DH 69
lines derived from a cross between a Chinese cabbage DH line Z16 (B. rapa ssp. 70
pekinensis) and a rapid cycling line L144 (B. rapa ssp. oleifera). The F1DH 71
population was previously used to develop a genetic map with InDels markers to 72
construct pseudo-chromosomes for B. rapa genome V1.5 (Wang et al., 2011). In this 73
work, SLAF-Seq (Sun et al., 2013) was perforemed on the 120 F1DH lines and their 74
two parents. 75
Two high-density linkage maps were constructed following the below procedures. 76
First, raw data were filtered to remove low-quality reads with the aforementioned 77
rules. Next, clean reads of the two parents of a population were aligned to reference 78
genome of B. rapa (V2.0) using SOAPaligner (SOAP2) (Li et al., 2009b) with 79
4
parameters “-m 100 –x 1000 –r 0” and those paired-end reads with unique hits to the 80
reference were preserved. SOAPsnp (Li et al., 2009a) was then used to call SNPs with 81
parameter “-L 100.” Confident SNPs were selected using following criteria: 1) 82
average quality score was over 20; 2) at least covered by three reads; 3) only 83
homozygous genotype was considered. After that, SNP loci showing polymorphism 84
between two parents were kept for further analysis. Each line of the population was 85
aligned to the B. rapa genome V2.0 using SOAP2 (Li et al., 2009b) with the same 86
parameter. Genotypes were identified for each line at those SNP loci showing 87
polymorphism between two parents. These ungenotyped loci were imputed using 88
algorithm of k-NN (the k nearest neighbors) (Larose, 2005). The parental inheritance 89
of each SNP loci was then determined for each line of the population, and genomic 90
regions showing complete linkage disequilibrium in the whole population were 91
merged to form bin-markers. Finally, these bin-markers were submitted to software 92
JoinMap (version 4.0) (Van Ooijen, 2006) to construct the genetic map. This process 93
was repeated to both the RILs and F1DH populations to build two genetic maps. 94
28 mis-assembled scaffolds were corrected with information from the two 95
genetic maps. The 28 scaffolds were split and their fragments were assigned to 96
chromosomes as other scaffolds. Information of the RILs genetic map was served as 97
the principal evidence (higher density of resequencing) to direct the orders of those 98
assembled scaffolds. The genetic map built on SLAF-Seq of F1DH population was 99
used to assist the ordering and orientation of scaffolds when there was limited or no 100
recombination information in the RILs map. The physical genomic syntenic 101
relationships between B. rapa and A. thaliana or S. parvula were also used as 102
evidences in determining the order and orientation of scaffolds in local regions. 103
104
Constructing transcripts from mRNA-Seq data 105
29 Gb mRNA-Seq data were generated from eight tissues of B. rapa: aboveground 106
stem, flower, small anther, middle anther, tender leaf, middle leaf, petiole, and seed 107
pod. Detailed information of each tissue is listed in Supplemental Table 11. In 108
addition to the newly generated data, mRNA-Seq datasets from another two 109
5
previously reported datasets were also incorporated. The first dataset was used to 110
verify gene models in the first release of B. rapa genome (Wang et al., 2011). The 111
second dataset was used in the study of gene expression during leaf development of 112
the Chinese cabbage (Wang et al., 2012). 113
Both de novo and genome-guided approaches were used to assemble these large 114
volumes of mRNA-Seq reads into transcripts. For the de novo approach, the Trinity 115
(Grabherr et al., 2011) tool package was used with default parameters to assemble 116
mRNA-Seq reads. For the genome-guided approach, several Perl scripts 117
(alignReads.pl, prep_rnaseq_alignments_for_genome_assisted_assembly.pl, 118
GG_write_trinity_cmds.pl, ParaFly and GG_trinity_accession_incrementer.pl) 119
embedded in Trinity were used to complete the alignment and assembly processes 120
step by step. 121
122
Genome annotation 123
Gene prediction process consisted of the following steps: 1) ab initio gene modeling; 124
2) detection of homologous genes; 3) transcript fragments mapping; and 4) merging 125
of the three predictions. Repeat sequences were masked before gene annotation. In the 126
first step, two gene predictors, Augustus and GlimmerHMM, were used for de novo 127
gene prediction. For reports of both tools, predicted genes whose coding regions were 128
shorter than 150 bp were filtered. In the second step, two datasets, A. thaliana 129
(TAIR10) and C. rubella protein sequences were collected. GenBlastA (She et al., 130
2009) was used to align A. thaliana and C. rubella protein sequences to the masked B. 131
rapa genome with parameters “-e 1e-5.” The candidate gene sequences generated by 132
genBlastA along with homologous proteins were further processed by GeneWise 133
(Birney et al., 2004) with default parameters to predict gene structure. For the third 134
step, three kinds of evidence were used to perform transcripts-assisted gene prediction: 135
1) a set of 214,482 B. rapa ESTs downloaded from NCBI; 2) de novo assembled 136
transcripts; and 3) genome-guided assembled transcripts. These transcript-related 137
6
sequences were aligned to the masked genome and results were processed by PASA 138
(Haas et al., 2003) with default parameters. 139
In the final step, EVidenceModeler (EVM) (Haas et al., 2008) was used to merge 140
all results from de novo prediction, homologous prediction, and PASA annotation, 141
into a weighted consensus gene dataset (weight values: 1 for de novo prediction, 5 for 142
homologous prediction, and 10 for PASA annotation). The EVM results were further 143
filtered with following criteria: 1) genes with protein-coding regions shorter than 144
150 bp; 2) incomplete genes without start and stop codons; and 3) genes with 145
premature termination in translation process. Finally, PASA was used to improve the 146
EVM gene models by modifying exons, adding UTRs, and determining alternatively 147
spliced isoforms. Prediction of gene sets V1.5 and V2.0 were further assessed and 148
compared using BUSCO analysis (Simao et al., 2015) (Supplemental Table 12). 149
After gene model prediction, gene function annotation was performed. 150
InterProScan (Hunter et al., 2009) was used to annotate motifs and domains by 151
comparing predicted genes with available databases such as PROSITE, PRINTS, 152
Pfam, ProDom, and SMART. The GO annotation was extracted from the output of 153
InterProScan. The predicted proteins were further aligned to the Swiss-Prot, TrEMBL, 154
and KEGG databases using BLASTP at E value 1×10-5
to obtain other annotation 155
information. All of these four annotation datasets are freely available through: 156
http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/. 157
158
Annotation of transposable elements and non-coding RNAs 159
Transposable elements (TEs) were annotated using RepeatMasker (Tarailo‐Graovac 160
and Chen, 2009). In order to compare features of TEs between B. rapa genome V1.5 161
and V2.0, TEs were re-annotated on both genome versions using the same methods 162
described elsewhere (Wang and Cheng, 2016). For these newly annotated TEs in V2.0, 163
two situations were observed. 1) Extra TEs in V2.0 without counterparts found in 164
corresponding regions of V1.5 (Supplemental Figure 7A); 2) More TEs were 165
annotated in V2.0 than those in V1.5 in corresponding regions (Supplemental 166
7
Figure 7B). Full length LTRs were identified using LTR_Finder with default 167
parameters (Xu and Wang, 2007). 168
169
According to the structural characteristics of tRNA, tRNAscan-SE-1.23 (Lowe and 170
Eddy, 1997) was used to identify tRNA sequences. rRNA was located by aligning 171
known full-length rRNAs of plants onto the B. rapa genome. snRNA sequences were 172
predicted using Rfam-9.1 (Griffiths-Jones et al., 2005). miRNA was predicted using 173
the similar methods reported by Sun et al (Sun et al., 2015). Totally, we annotated 174
different kinds of ncRNAs that accounted for ≈0.306% of the updated genome 175
(Supplemental Table 13). 176
177
Sub-genomes reconstruction and analysis 178
Based syntenic relationship between B. rapa genome V2.0 and A. thaliana, the least 179
fractionated (LF), the medium fractionated (MF1) and the most fractionated (MF2) 180
sub-genomes of B. rapa V2.0 were built. Detailed information of paralogous genes in 181
three sub-genomes was listed in Supplemental Table 14. The LF sub-genome 182
maintained more gene copies than MF1/2 sub-genome. Part of block D was detected 183
to be lost which resulted in the low gene density in LF sub-genome at the beginning 184
region of ancestral chromosome tPCK7 (Supplemental Figure 8). 185
2,058 fully retained genes in all three sub-genomes were identified. After 186
removing TA (tandem) genes and MT (one gene in B.rapa genome V1.5 or in A. 187
thaliana, while multiple homologous genes in V2.0) genes, genes from 1,307 fully 188
retained homoeologs were chosen to analyze patterns of gene expression among 189
sub-genomes using the same methods described elsewhere (Feng et al., 2012). Genes 190
in sub-genome LF are dominantly expressed over genes in MF1 and MF2 191
(Supplemental Table 15). 192
The relationship between twenty-four nucleotides small RNA targeted TEs and 193
the expression pattern of gene doublets was analyzed using similar methods reported 194
previously (Cheng et al., 2016). mRNA-Seq data of three organs (root, leaf and stem) 195
and small RNA data for the leaf used in this study were retrieved from 196
8
http://brassicadb.org/brad/datasets/pub/srna2/. For sub-genome LF and MF1, LF and 197
MF2, small RNA targeted TEs (RNA+ TEs) showed negative association with 198
expression levels of gene doublets, namely, dominantly expressed genes from a 199
sub-genome showed low level of RNA+ TEs in the 2 kb of 5′ UTR regions 200
(Supplemental Figure 9). 201
202
Comparison of gene annotation between V1.5 and V2.0 203
Genome synteny analysis was performed between the two versions using SynOrths 204
(Cheng et al., 2012) to determine corresponding gene pairs (i.e. the same gene loci 205
in two different assemblies) and tandem gene arrays. With the results reported by 206
SynOrths, relationships of tandem gene arrays in the two genome versions were 207
classified into three categories: 1) tandem arrays in one version are syntenic to tandem 208
arrays in the other version; 2) tandem arrays in one version correspond to non-tandem 209
arrays in the other one; 3) tandem arrays in one version can’t map to any gene in the 210
other version. Non-tandem genes in the two predictions were classified into four 211
categories: 1) one gene in V1.5 is the counterpart of one gene in V2.0 (one-to-one 212
gene pairs); 2) one gene in V1.5 corresponds to several genes (≥2) in V2.0 213
(one-to-multiple genes); 3) several genes (≥2) in V1.5 corresponds to one gene in 214
V2.0 (multiple-to-one genes) (Supplemental Figure 4B); and 4) genes specific to each 215
annotation (genes with no counterparts in the other versuib). 216
To further analyze the differences in genes between these two predictions, a 217
detailed analysis was performed for each of the above four gene categories. The 218
protein sequences of one-to-one gene pairs were compared using Clustalx (Larkin et 219
al., 2007) to assess variations in gene coding sequences. For one-to-multiple and 220
multiple-to-one genes, mRNA-Seq reads were mapped to the CDs sequences of these 221
gene pairs using BWA (Li, 2013) to find evidence supporting whether these genes 222
were correctly annotated. The work focused on paired-end reads with one read 223
mapped to one of the multiple genes, while the other read mapped to another one of 224
the multiple genes. It indicates that the two genes should be re-annotated as one gene. 225
9
For these annotation and version-specific genes (no counterparts), the following two 226
analysis was performed to assess their reliability: 1) mRNA-Seq reads were mapped 227
to the CDSs of these group-specific genes using BWA; 2) They were BLASTed 228
(BLASTP) against proteins of five other Brassica species, with an E-value of 1×10-5
229
(Supplemental Table 16). Mapping of mRNA-Seq reads to CDS of these genes found 230
that 3,470 genes in V1.5 and 6,350 in V2.0 had mRNA-Seq evidences to support the 231
reliability of these version-specific genes. 3,244 genes in V1.5 and 6,687 genes in 232
V2.0 had BLASTP hits to proteins of other Brassica species (Supplemental Table 16). 233
234
GO term enrichment analysis 235
For each GO term, its occurrence in the 3-copy gene set and in combined 1- and 236
2-copy geneset were counted in V2.0. And the occurrence in the 1-copy gene set and 237
in combined 2- and 3-copy gene set were also counted as reported previously (Wang 238
et al., 2011). Fisher’s exact tests “1+2 vs 3” and “1 vs 2+3” were performed to test the 239
over retention of 3-copy and the under retention of 1-copy orthologous GO terms, 240
respectively. The results showed that genes encoding subunits of proteasomes, 241
ribosomes, and transcription factor complexes were over retained (Supplemental 242
Table 17). While genes associated with DNA repair, binding and chloroplast were 243
under retained (Supplemental Table 18). These finding were in accordance with the 244
gene balance hypothesis which predicts that genes with products interact with other 245
gene products are more likely to be over retained and otherwise they are more likely 246
to be under retained (Freeling, 2009). Genes associated with response to environment 247
factors and plant hormones were also over retained (Supplemental Table 17). 248
Using similar method, GO enrichment of tandem gene arrays in B. rapa genome 249
V2.0 were also analyzed. The occurrence of each GO term in the tandem arrays and 250
non-tandem genes were counted. Fisher’s exact test was performed to test whether a 251
GO term was enriched in the tandem gene arrays. Results show that genes related to 252
defense response, membrane functions, and different kinds of enzyme activity were 253
enriched in tandem gene arrays (Supplemental Table 9). 254
255
10
256
Accession numbers 257
This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank 258
under the accession AENI00000000. The version described in this paper is version 259
AENI02000000. The genome assembly and gene annotation results were also freely 260
available through BRAD website (Brassica database, 261
http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/). 262
263
264
11
Supplemental Tables 265
266
Supplemental Table 1. Summary of Illumina sequencing data used in the assembly 267
of B. rapa genome V2.0. 268
Insert Size
(bp)
Read Length
(bp)
Raw Data
(bp)
Clean Data
(bp)
Sequence
Depth (X)
250* 150 73,219,780,500 55,479,375,000 114.39
20,000* 44/49/90 1,890,331,4720 5,638,004,562 11.62
40,000* 90 16,801,802,820 3,105,338,220 6.4
PacBio* / 6,473,983,398 6,412,500,309** 13.22
200 44/75/100 19,264,162,762 13,697,925,048 28.24
500 44/75 7,807,414,874 6,407,407,132 13.21
2,000 44/75 3,581,696,974 3,322,057,634 6.85
5,000 44 3,212,095,304 2,297,055,728 4.74
8,000 44 2,469,275,072 1,255,661,264 2.59
10,000 44 6,106,076,328 4,530,229,440 9.34
14,000 44 2,066,000,464 771,807,344 1.59
BAC-end / 154,631,945 / /
*: newly generated in this project; 269
**: corrected previous Illumina reads. 270
271
12
Supplemental Table 2. Comparisons of statistics between B. rapa genome assembly V1.5 and V2.0.
V1.5 V2.0 (Scaffold Refined with PacBio)
Contig
Size
(bp)
Number
of
Contigs
Scaffold
Size
(bp)
Number
of
Scaffolds
Contig
Size
(bp)
Number
of
Contigs
Scaffold
Size
(bp)
Number
of
Scaffolds
N90 9,900 6,351 308,587 180 5,939 8,456 25,622 349
N80 19,011 4,395 635,699 118 19,181 5,373 975,164 89
N70 27,303 3,202 1,100,015 85 30,559 3,878 1,852,239 62
N60 36,355 2,334 1,372,560 62 41,205 2,850 2,610,373 44
N50 46,088 1,663 1,846,652 44 52,684 2,063 3,377,735 31
Total Size 273,100,332
283,810,373
366,413,862
389,189,875
Total Number
(>=100 bp)
51,647
40,576
96,883
86,986
Total Number
(>=2 kb)
9,553
821
10,673
2,178
13
Supplemental Table 3. Comparisons of statistics on B. rapa genome assembly V2.0
before and after using PacBio data.
Contig (before using
PacBio Data)
Scaffold (before using
PacBio Data)
Scaffold Refined
with PacBio Data
Size (bp) Number Size (bp) Number Size (bp) Number
N90 2,566 14,739 40,173 354 25,622 349
N80 10,065 8,939 784,999 97 975,164 89
N70 16,336 6,370 1,636,670 66 1,852,239 62
N60 22,619 4,640 2,289,139 47 2,610,373 44
N50 29,362 3,348 2,973,276 33 3,377,735 31
Total size 333,541,035
370,219,313
389,189,875
Total
Number
(>100bp)
104,563
87,323
86,986
Total
Number
(>2kb)
15,785
1,697
2,178
Gaps
36,678,278
22,776,013
14
Supplemental Table 4. Statistics of two genetic maps and assignment of assembly V2.0 to ten chromosomes of B. rapa with the two maps.
RILs DH
Pseudo-
chromosome
No. of
Binmarkers
Position
(cM)
No. of
Linked
Scaffolds
Length
(Mb)
No. of
Binmarkers
Position
(cM)
No. of
Linked
Scaffolds
Length
(Mb)
A01 196 162.14 29 33.7 133 128.911 27 33.3
A02 172 133.611 15 30.4 109 140.507 14 30.0
A03 228 178.902 5 36.4 111 170.588 7 39.3
A04 108 123.855 14 23.4 85 95.46 14 23.4
A05 214 117.505 30 36.0 150 153.926 25 35.5
A06 256 115.225 25 39.7 137 137.036 18 37.3
A07 149 123.353 10 29.7 117 148.978 11 34.8
A08 116 115.619 15 27.7 111 127.699 14 27.4
A09 264 136.684 31 54.4 143 188.759 29 53.9
A10 102 109.837 6 18.5 71 99.652 9 21.2
Total 1,805 1,316.731 146 329.9 1,167 1,391.516 138 336.1
15
Supplemental Table 5. Lists of mis-assembled scaffolds and their splicing information.
Chr ID Sca ID Sca splited ID Start End Order Supporting RILs
Binmarkers
A05 Scaffold000001 Scaffold000001_1 1 1568174 + 12 BinMarkers
A09 Scaffold000001 Scaffold000001_2 1568175 2839151 + 1 BinMarkers
A09 Scaffold000001 Scaffold000001_3 2839152 6711068 - 8 BinMarkers
A01 Scaffold000001 Scaffold000001_4 6711069 7145794 + 3 BinMarkers
A03 Scaffold000001 Scaffold000001_5 7145795 16627912 + 50 BinMarkers
A07 Scaffold000003 Scaffold000003_1 1 4706983 + 18 BinMarkers
A09 Scaffold000003 Scaffold000003_2 4706984 9295739 + 8 BinMarkers
A05 Scaffold000003 Scaffold000003_3 9295740 10101789 - 4 BinMarkers
A06 Scaffold000003 Scaffold000003_4 10101790 11093742 +
A10 Scaffold000005 Scaffold000005_1 1499897 8038702 ? 52 BinMarkers
A08 Scaffold000005 Scaffold000005_2 1 1499896 ? 4 BinMarkers
A01 Scaffold000006 Scaffold000006_1 1 879152 + 5 BinMarkers
A07 Scaffold000006 Scaffold000006_2 879153 4858235 + 18 BinMarkers
A02 Scaffold000006 Scaffold000006_3 4858236 6679097 + 14 BinMarkers
A05 Scaffold000006 Scaffold000006_4 6679123 7844833 +
A01 Scaffold000007 Scaffold000007_1 1 2257556 + 5 BinMarkers
A09 Scaffold000007 Scaffold000007_2 2257557 5817461 - 9 BinMarkers
A08 Scaffold000007 Scaffold000007_3 5817462 7741191 ? 1 BinMarkers
A06 Scaffold000008 Scaffold000008_1 1 784055 - 5 BinMarkers
A03 Scaffold000008 Scaffold000008_2 784056 6702624 - 19 BinMarkers
16
A06 Scaffold000008 Scaffold000008_3 6702625 7319999 + 5 BinMarkers
A05 Scaffold000009 Scaffold000009_1 5502703 7126136 ? 14 BinMarkers
A02 Scaffold000009 Scaffold000009_2 1801030 5502702 ? 20 BinMarkers
A09 Scaffold000009 Scaffold000009_3 1 1801029 ? 5 BinMarkers
A07 Scaffold000010 Scaffold000010_1 1 2672674 + 17 BinMarkers
A01 Scaffold000010 Scaffold000010_2 2672675 6707728 + 11 BinMarkers
A02 Scaffold000012 Scaffold000012_1 3977006 6361331 ? 18 BinMarkers
A09 Scaffold000012 Scaffold000012_2 1 3977005 ? 10 BinMarkers
A04 Scaffold000014 Scaffold000014_1 1 1582907 + 2 BinMarkers
A05 Scaffold000014 Scaffold000014_2 1582908 3506367 + 9 BinMarkers
A06 Scaffold000014 Scaffold000014_3 3506368 5718963 + 13 BinMarkers
A09 Scaffold000016 Scaffold000016_1 1 4330173 - 6 BinMarkers
A06 Scaffold000016 Scaffold000016_2 4330174 5621156 + 5 BinMarkers
A01 Scaffold000022 Scaffold000022_1 3811411 4302067 ? 3 BinMarkers
A08 Scaffold000022 Scaffold000022_2 1 3811410 ? 11 BinMarkers
A07 Scaffold000029 Scaffold000029_1 2332209 4120263 + 4 BinMarkers
A02 Scaffold000029 Scaffold000029_2 1 2332208 + 3 BinMarkers
A01 Scaffold000030 Scaffold000030_1 1 1708633 - 8 BinMarkers
A02 Scaffold000030 Scaffold000030_2 1708634 3289170 - 8 BinMarkers
A05 Scaffold000031 Scaffold000031_1 1 932619 + 8 BinMarkers
A08 Scaffold000031 Scaffold000031_2 932620 3246927 - 17 BinMarkers
A06 Scaffold000034 Scaffold000034_1 1 3966651 + 2 BinMarkers
A05 Scaffold000034 Scaffold000034_2 3966652 4925386 + 7 BinMarkers
A04 Scaffold000040 Scaffold000040_1 1 1017356 + 2 BinMarkers
A08 Scaffold000040 Scaffold000040_2 1017357 2812160 ? 1 BinMarkers
A07 Scaffold000042 Scaffold000042_2 1249768 2610373 - 2 BinMarkers
17
A06 Scaffold000047 Scaffold000047_1 1 1339929 - 10 BinMarkers
A09 Scaffold000047 Scaffold000047_2 1339930 2395810 - 5 BinMarkers
A05 Scaffold000048 Scaffold000048_1 1 2314489 - 14 BinMarkers
A04 Scaffold000048 Scaffold000048_2 2314490 2394815 ? 1 BinMarkers
A05 Scaffold000051 Scaffold000051_1 1 194542 - 2 BinMarkers
A04 Scaffold000051 Scaffold000051_2 194543 2129242 + 12 BinMarkers
A05 Scaffold000059 Scaffold000059_1 1 1942021 +
A07 Scaffold000060 Scaffold000060_1 472477 1852239 ? 6 BinMarkers
A09 Scaffold000060 Scaffold000060_2 1 472476 ? 3 BinMarkers
A01 Scaffold000064 Scaffold000064_1 1 413784 -
A05 Scaffold000064 Scaffold000064_2 413785 1779331 + 5 BinMarkers
A08 Scaffold000069 Scaffold000069_1 1 994093 - 8 BinMarkers
A01 Scaffold000069 Scaffold000069_2 994094 1677521 - 4 BinMarkers
A05 Scaffold000091 Scaffold000091_1 1 370869 -
A06 Scaffold000091 Scaffold000091_2 370870 1095044 -
A05 Scaffold000103 Scaffold000103_1 469247 824434 ? 2 BinMarkers
A09 Scaffold000103 Scaffold000103_2 1 469246 ?
A01 Scaffold000108 Scaffold000108_1 1 452712 - 4 BinMarkers
A10 Scaffold000108 Scaffold000108_2 452713 661024 -
18
Supplemental Table 6. Detailed information about the order and orientation of scaffolds on ten chromosomes of B. rapa.
Group
ID
Chr
ID
Splited Sca
Order
Sca
Order Start End Order
Split
Sca
Z16XL144DH
Linkage Map
RILs
Linkage Map
G01 A01 Scaffold000019 Scaffold000019 1 4924075 -
support
G01 A01 Scaffold000022_1 Scaffold000022 3811411 4302067 ? Split support 3 BinMarkers
G01 A01 Scaffold001067 Scaffold001067 1 6655 ?
no support
G01 A01 Scaffold000081 Scaffold000081 1 1248719 -
support
G01 A01 Scaffold000025 Scaffold000025 1 3737062 +
support
G01 A01 Scaffold000108_1 Scaffold000108 1 452712 - Split support 4 BinMarkers
G01 A01 Scaffold000077 Scaffold000077 1 1442098 -
support
G01 A01 Scaffold000084 Scaffold000084 1 1146124 +
support
G01 A01 Scaffold000007_1 Scaffold000007 1 2257556 + Split support 5 BinMarkers
G01 A01 Scaffold000110 Scaffold000110 1 625704 -
support
G01 A01 Scaffold000058 Scaffold000058 1 1903670 +
support
G01 A01 Scaffold000030_1 Scaffold000030 1 1708633 - Split support 8 BinMarkers
G01 A01 Scaffold000064_1 Scaffold000064 1 413784 - Split no support 4 BinMarkers
G01 A01 Scaffold000069_2 Scaffold000069 994094 1677521 - Split support 4 BinMarkers
G01 A01 Scaffold000079 Scaffold000079 1 1429233 +
support
G01 A01 Scaffold000010_2 Scaffold000010 2672675 6707728 + Split support 11 BinMarkers
G01 A01 Scaffold000001_4 Scaffold000001 6711069 7145794 + Split support 3 BinMarkers
G01 A01 Scaffold000006_1 Scaffold000006 1 879152 + Split support 5 BinMarkers
G01 A01 Scaffold000080 Scaffold000080 1 1365531 +
support
G01 A01 Scaffold000097 Scaffold000097 1 839554 +
support
19
G01 A01 Scaffold000128 Scaffold000128 1 404483 -
support
G01 A01 Scaffold000116 Scaffold000116 1 540442 -
support
G01 A01 Scaffold000135 Scaffold000135 1 311370 -
support
G01 A01 Scaffold000104 Scaffold000104 1 785330 -
support
G01 A01 Scaffold000138 Scaffold000138 1 312602 +
support
G01 A01 Scaffold000155 Scaffold000155 1 209444 +
support
G01 A01 Scaffold000147 Scaffold000147 1 227696 -
support
G01 A01 Scaffold000151 Scaffold000151 1 191664 +
support
G01 A01 Scaffold000106 Scaffold000106 1 738834 +
support
G02 A02 Scaffold000068 Scaffold000068 1 1623172 -
support
G02 A02 Scaffold000055 Scaffold000055 1 2086791 +
support
G02 A02 Scaffold000113 Scaffold000113 1 565333 -
support
G02 A02 Scaffold000044 Scaffold000044 1 3377735 +
support
G02 A02 Scaffold000041 Scaffold000041 1 2704147 +
support
G02 A02 Scaffold000123 Scaffold000123 1 406925 +
support
G02 A02 Scaffold000009_2 Scaffold000009 1801030 5502702 ? Split support 20 BinMarkers
G02 A02 Scaffold000062 Scaffold000062 1 1818794 +
support
G02 A02 Scaffold000012_1 Scaffold000012 3977006 6361331 ? Split support 18 BinMarkers
G02 A02 Scaffold000083 Scaffold000083 1 1181007 -
support
G02 A02 Scaffold000006_3 Scaffold000006 4858236 6679097 + Split support 14 BinMarkers
G02 A02 Scaffold000130 Scaffold000130 1 415228 -
no support
G02 A02 Scaffold000029_2 Scaffold000029 1 2332208 + Split support 3 BinMarkers
G02 A02 Scaffold000030_2 Scaffold000030 1708634 3289170 - Split support 8 BinMarkers
G02 A02 Scaffold000021 Scaffold000021 1 4367232 -
support
G03 A07 Scaffold000072 Scaffold000072 1 1602627 -
support
G03 A07 Scaffold000060_1 Scaffold000060 472477 1852239 ? Split support 6 BinMarkers
20
G03 A07 Scaffold000029_1 Scaffold000029 2332209 4120263 + Split support 4 BinMarkers
G03 A07 Scaffold000003_1 Scaffold000003 1 4706983 + Split support 18 BinMarkers
G03 A07 Scaffold000045 Scaffold000045 1 2590264 -
support
G03 A07 Scaffold000006_2 Scaffold000006 879153 4858235 + Split support 18 BinMarkers
G03 A07 Scaffold000042_2 Scaffold000042 1249768 2610373 - Split support 2 BinMarkers
G03 A07 Scaffold000010_1 Scaffold000010 1 2672674 + Split support 17 BinMarkers
G03 A07 Scaffold000017 Scaffold000017 1 5123980 -
support
G03 A07 Scaffold000020 Scaffold000020 1 4515445 -
support
G04 A09 Scaffold000011 Scaffold000011 1 6530455 -
support
G04 A09 Scaffold000060_2 Scaffold000060 1 472476 ? Split support 3 BinMarkers
G04 A09 Scaffold000153 Scaffold000153 1 199367 -
no support
G04 A09 Scaffold000085 Scaffold000085 1 1138684 +
support
G04 A09 Scaffold000131 Scaffold000131 1 386326 +
support
G04 A09 Scaffold000105 Scaffold000105 1 763457 +
support
G04 A09 Scaffold000063 Scaffold000063 1 1763648 -
support
G04 A09 Scaffold000125 Scaffold000125 1 413779 -
support
G04 A09 Scaffold000036 Scaffold000036 1 2972422 +
support
G04 A09 Scaffold000001_3 Scaffold000001 2839152 6711068 - Split support 8 BinMarkers
G04 A09 Scaffold000109 Scaffold000109 1 644992 +
support
G04 A09 Scaffold000506 Scaffold000506 1 21420 ?
no support
G04 A09 Scaffold000033 Scaffold000033 1 3178030 -
support
G04 A09 Scaffold000012_2 Scaffold000012 1 3977005 ? Split support 10 BinMarkers
G04 A09 Scaffold000001_2 Scaffold000001 1568175 2839151 + Split support 1 BinMarkers
G04 A09 Scaffold000016_1 Scaffold000016 1 4330173 - Split support 6 BinMarkers
G04 A09 Scaffold000073 Scaffold000073 1 1599488 +
support
G04 A09 Scaffold000003_2 Scaffold000003 4706984 9295739 + Split support 8 BinMarkers
21
G04 A09 Scaffold000092 Scaffold000092 1 920634 -
support
G04 A09 Scaffold000103_2 Scaffold000103 1 469246 ? Split no support
G04 A09 Scaffold000112 Scaffold000112 1 611878 -
support
G04 A09 Scaffold000094 Scaffold000094 1 882339 +
support
G04 A09 Scaffold000009_3 Scaffold000009 1 1801029 ? Split support 5 BinMarkers
G04 A09 Scaffold000061 Scaffold000061 1 1921111 +
support
G04 A09 Scaffold000007_2 Scaffold000007 2257557 5817461 - Split support 9 BinMarkers
G04 A09 Scaffold000621 Scaffold000621 1 13908 ?
no support
G04 A09 Scaffold000047_2 Scaffold000047 1339930 2395810 - Split support 5 BinMarkers
G04 A09 Scaffold000050 Scaffold000050 1 2386279 -
support
G04 A09 Scaffold000533 Scaffold000533 1 13461 ?
no support
G04 A09 Scaffold000493 Scaffold000493 1 63376 ?
no support
G04 A09 Scaffold000071 Scaffold000071 1 2554287 -
support
G04 A09 Scaffold000505 Scaffold000505 1 15192 ?
no support
G05 A05 Scaffold000103_1 Scaffold000103 469247 824434 ? Split support 2 BinMarkers
G05 A05 Scaffold000120 Scaffold000120 1 482396 ?
support
G05 A05 Scaffold000152 Scaffold000152 1 186206 ?
no support
G05 A05 Scaffold000091_1 Scaffold000091 1 370869 - Split no support 2 BinMarkers
G05 A05 Scaffold000014_2 Scaffold000014 1582908 3506367 + Split support 9 BinMarkers
G05 A05 Scaffold000048_1 Scaffold000048 1 2314489 - Split support 14 BinMarkers
G05 A05 Scaffold000051_1 Scaffold000051 1 194542 - Split support 2 BinMarkers
G05 A05 Scaffold000031_1 Scaffold000031 1 932619 + Split support 8 BinMarkers
G05 A05 Scaffold000001_1 Scaffold000001 1 1568174 + Split support 12 BinMarkers
G05 A05 Scaffold000059_1 Scaffold000059 1 1942021 +
support
G05 A05 Scaffold000034_2 Scaffold000034 3966652 4925386 + Split support 7 BinMarkers
G05 A05 Scaffold000064_2 Scaffold000064 413785 1779331 + Split support 5 BinMarkers
22
G05 A05 Scaffold000511 Scaffold000511 1 23219 ?
no support
G05 A05 Scaffold000042 Scaffold000042 1 2610373 + Split support 7 BinMarkers
G05 A05 Scaffold000133 Scaffold000133 1 369630 ?
support
G05 A05 Scaffold000076 Scaffold000076 1 1504109 +
support
G05 A05 Scaffold000006_4 Scaffold000006 6679123 7844833 + Split no support 8 BinMarkers
G05 A05 Scaffold000038 Scaffold000038 1 2985750 +
support
G05 A05 Scaffold000132 Scaffold000132 1 382277 +
support
G05 A05 Scaffold000003_3 Scaffold000003 9295740 10101789 - Split support 4 BinMarkers
G05 A05 Scaffold000143 Scaffold000143 1 260658 +
no support
G05 A05 Scaffold000009_1 Scaffold000009 5502703 7126136 ? Split support 14 BinMarkers
G05 A05 Scaffold000024 Scaffold000024 1 4645629 -
support
G05 A05 Scaffold000129 Scaffold000129 1 400740 +
support
G05 A05 Scaffold000160 Scaffold000160 1 174875 +
no support
G05 A05 Scaffold000119 Scaffold000119 1 494856 -
no support
G05 A05 Scaffold000183 Scaffold000183 1 90881 ?
no support
G05 A05 Scaffold000127 Scaffold000127 1 399023 -
support
G05 A05 Scaffold000027 Scaffold000027 1 3437377 -
support
G05 A05 Scaffold000056 Scaffold000056 1 2001222 -
support
G06 A10 Scaffold000005_1 Scaffold000005 1499897 8038702 ? Split support 52 BinMarkers
G06 A10 Scaffold000015 Scaffold000015 1 5489197 +
support 27 BinMarkers
G06 A10 Scaffold000108_2 Scaffold000108 452713 661024 - Split no support 2 BinMarkers
G06 A10 Scaffold000075 Scaffold000075 1 1477010 -
support 2 BinMarkers
G06 A10 Scaffold000057 Scaffold000057 1 1972711 -
support 9 BinMarkers
G06 A10 Scaffold000037 Scaffold000037 1 2850418 +
support 10 BinMarkers
G07 A03 Scaffold000013 Scaffold000013 1 6133134 +
support 44 BinMarkers
G07 A03 Scaffold000070 Scaffold000070 1 1618636 +
support 13 BinMarkers
23
G07 A03 Scaffold000002 Scaffold000002 1 13282552 +
support 102 BinMarkers
G07 A03 Scaffold000001_5 Scaffold000001 7145795 16627912 + Split support 50 BinMarkers
G07 A03 Scaffold000008_2 Scaffold000008 784056 6702624 - Split support 19 BinMarkers
G08 A06 Scaffold000393 Scaffold000393 1 39066 ?
no support
G08 A06 Scaffold001083 Scaffold001083 1 3324 ?
no support
G08 A06 Scaffold000221 Scaffold000221 1 80530 ?
no support
G08 A06 Scaffold000595 Scaffold000595 1 10337 ?
no support
G08 A06 Scaffold000145 Scaffold000145 1 261692 -
no support
G08 A06 Scaffold000003_4 Scaffold000003 10101790 11093742 + Split no support 4 BinMarkers
G08 A06 Scaffold000014_3 Scaffold000014 3506368 5718963 + Split support 13 BinMarkers
G08 A06 Scaffold000016_2 Scaffold000016 4330174 5621156 + Split support 5 BinMarkers
G08 A06 Scaffold000177 Scaffold000177 1 147072 ?
support
G08 A06 Scaffold000091_2 Scaffold000091 370870 1095044 - Split no support 5 BinMarkers
G08 A06 Scaffold000609 Scaffold000609 1 10123 ?
no support
G08 A06 Scaffold000136 Scaffold000136 1 324742 +
no support
G08 A06 Scaffold000034_1 Scaffold000034 1 3966651 + Split support 2 BinMarkers
G08 A06 Scaffold000139 Scaffold000139 1 305217 -
support
G08 A06 Scaffold000047_1 Scaffold000047 1 1339929 - Split support 10 BinMarkers
G08 A06 Scaffold000008_1 Scaffold000008 1 784055 - Split support 5 BinMarkers
G08 A06 Scaffold000035 Scaffold000035 1 3018247 +
support
G08 A06 Scaffold000087_con Scaffold000087 1 1117089 -
support
G08 A06 Scaffold000039 Scaffold000039 1 2792378 -
support
G08 A06 Scaffold000008_3 Scaffold000008 6702625 7319999 + Split support 5 BinMarkers
G08 A06 Scaffold000023 Scaffold000023 1 4237132 -
support
G08 A06 Scaffold000118 Scaffold000118 1 527563 +
support
G08 A06 Scaffold000004 Scaffold000004 1 9448172 +
support
24
G08 A06 Scaffold000052 Scaffold000052 1 2161117 +
support
G08 A06 Scaffold000053 Scaffold000053 1 2145108 +
support
G08 A06 Scaffold000082 Scaffold000082 1 1179777 -
support
G09 A08 Scaffold000140 Scaffold000140 1 286569 -
support
G09 A08 Scaffold000007_3 Scaffold000007 5817462 7741191 ? Split support 1 BinMarkers
G09 A08 Scaffold000069_1 Scaffold000069 1 994093 - Split support 8 BinMarkers
G09 A08 Scaffold000022_2 Scaffold000022 1 3811410 ? Split support 11 BinMarkers
G09 A08 Scaffold000005_2 Scaffold000005 1 1499896 ? Split support 4 BinMarkers
G09 A08 Scaffold000098 Scaffold000098 1 827123 +
support
G09 A08 Scaffold000046 Scaffold000046 1 2618469 -
support
G09 A08 Scaffold000054 Scaffold000054 1 2092286 +
support
G09 A08 Scaffold000146 Scaffold000146 1 242819 ?
no support
G09 A08 Scaffold000074 Scaffold000074 1 1513914 +
support
G09 A08 Scaffold000026 Scaffold000026 1 3517859 -
support
G09 A08 Scaffold000040_2 Scaffold000040 1017357 2812160 ? Split support 1 BinMarkers
G09 A08 Scaffold000067 Scaffold000067 1 1623910 +
support
G09 A08 Scaffold000031_2 Scaffold000031 932620 3246927 - Split support 17 BinMarkers
G09 A08 Scaffold000043 Scaffold000043 1 2595475 +
support
G10 A04 Scaffold000018 Scaffold000018 1 5044191 -
support
G10 A04 Scaffold000014_1 Scaffold000014 1 1582907 + Split support 2 BinMarkers
G10 A04 Scaffold000032 Scaffold000032 1 3167877 -
support
G10 A04 Scaffold000099 Scaffold000099 1 838119 +
support
G10 A04 Scaffold000040_1 Scaffold000040 1 1017356 + Split support 2 BinMarkers
G10 A04 Scaffold000028 Scaffold000028 1 5022525 +
support
G10 A04 Scaffold000049 Scaffold000049 1 2365081 -
support
G10 A04 Scaffold000051_2 Scaffold000051 194543 2129242 + Split support 12 BinMarkers
25
G10 A04 Scaffold000090 Scaffold000090 1 975164 -
support
G10 A04 Scaffold000161 Scaffold000161 1 145531 ?
support
G10 A04 Scaffold000111 Scaffold000111 1 588963 +
support
G10 A04 Scaffold000048_2 Scaffold000048 2314490 2394815 ? Split support 1 BinMarkers
G10 A04 Scaffold000117 Scaffold000117 1 525733 +
support
G10 A04 Scaffold000176 Scaffold000176 1 114162 ?
support
26
Supplemental Table 7. Comparisons of statistics on genomic annotations between B.
rapa genome V1.5 and V2.0
Genome assemVer 1.5 assemVer 2.0
Size (bp)
GC Content
283,810,373
35.26%
389,189,875
36.17%
Genes annotVer 1.5 annotVer 2.0
Number of Genes
Number of Genes on Plus Strand
Number of Genes on Minus Strand
Multi-exon Genes
Mean Gene Length (bp)
Gene density (Kb/gene)
Number of Transcripts
Percent of Transcripts with Introns
Mean Transcript Length (bp)
Mean CDS Length
Percent Coding
41,174
20,608
20,566
32,240
2,015
6.9
41,174
78.30%
2,015
1,172
17.00%
48,826
24,684
24,142
36,850
1,908
8.0
55,959
78.04%
2,348
1,100
13.80%
Exons
Number
Mean Number per Transcript
Mean Length (bp)
Total Length (bp)
annotVer 1.5
206,990
5.03
233
48,237,786
annotVer 2.0
237,462
4.86
226
53,705,886
Introns
Number
Mean Number per Transcript
Mean Length (bp)
Total Length (bp)
annotVer 1.5
165,816
4.03
209
34,732,551
annotVer 2.0
188,636
3.86
209
39,443,842
UTRs
Number of Genes Having UTRs
Mean UTR Length (bp)
Number of 5′ UTRs
Mean 5′ UTR Length (bp)
Number of 3′ UTRs
Mean 3′ UTR Length (bp)
annotVer 1.5
NA
NA
NA
NA
NA
NA
annotVer 2.0
29,423
230.23
40,854
178.95
35,836
288.69
27
Supplementary Table 8. Classification of tandem gene arrays of two versions based
on their syntenic relationship.
Item V1.5 V2.0
Tandem-to-Tandema 1,458 1,517
Tandem-to-NonTandemb
453 1,636
Tandem-to-NonSyntenyc 253 372
Total 2,164 3,532
a: counterparts of tandem arrays in V1.5 (or V2.0) are also tandem arrays in V2.0 (or
V1.5)
b: tandem arrays in V1.5 (or V2.0) correspond to non-tandem arrays in V2.0 (or V1.5)
c: tandem arrays in V1.5 (or V2.0) have no syntenic counterparts in V2.0 (or V1.5)
28
Supplementary Table 9. The enriched GO terms among tandem arrays in B. rapa
genome V2.0. (excel table)
29
Supplemental Table 10. Comparisons of the two annotations based on their syntenic
relationship.
Item V1.5 V2.0
Total genes 41,174 48,826
Tandem (#arrays|#genes) 2,164|5,228 3,525|7,977
Tandem Redundancy Removed 38,110 44,374
Non-syntenic Genes 3,834 8,630
Syntenic Genes 34,276 35,744
One-to-one Genes 29,294 29,294
Multiple-to-one Genes 2,235 1,076
One-to-multiple Genes 2,327 4,972
30
Supplementary Table 11. Statistics of mRNA-Seq data.
Tissue Total Raw Data (Gb)
Above-ground Stem 3.15
Flower 4.06
Small Anther 3.77
Middle Anther 3.32
Tender Leaf 4.7
Middle Leaf 3.32
Petiole 3.48
Seed Pod 3.26
31
Supplementary Table 12. Assessment of V1.5 and V2.0 genesets in BUSCO
notation.
Version Gene number BUSCO notation assessment results
V1.5 41,174 C:93.7% [D:67%], F:2.1%, M:4.2%, n:429*
V2.0 48,826 C:93% [D:67%], F:2.8%, M:4.2%, n:429
* C:complete [D:duplicated], F:fragmented, M:missing, n:gene number.
32
Supplemental Table 13. Statistics of non-coding RNAs.
Type Subtype Copy Average
Length (bp)
Total
Length (bp) % Genome
miRNA 1,295 211 273,763 0.070
tRNA 1,391 75 104,488 0.027
rRNA 1,730 299 517,909 0.133
18S 677 499 337,911 0.087
28S 573 120 68,828 0.018
5.8S 244 335 81,737 0.021
5S 236 125 29,433 0.007
snRNA 3,511 84 295,668 0.076
CD-box 3,201 79 253,329 0.065
HACA-box 130 120 15,639 0.004
splicing 180 148 26,700 0.007
33
Supplementary Table 14. Detailed sub-genome information of B. rapa genome V2.0.
(excel table)
34
Supplementary Table 15. Dominant gene expression between sub-genomes LF and
MFs in B. rapa genome V2.0.
Organisms #2-fold changes*
Not
expressed
Bionominal test
(LF & MFs)
LF MF1 MF2
leaf 300 178 154 87 2.61E-13
stem 280 183 154 71 6.29E-10
root 264 197 165 48 4.40E-06
*: number of genes expressed at least two-fold higher compared to both the other two
syntenic genes.
35
Supplementary Table 16. Sequence homology between B. rapa version-specific
genes and other Brassica species.
Species
V1.5 V2.0
hits
identity>=70%
coverage>=70% hits
identity>=70%
coverage>=70%
A. thaliana 2,278 972 4,631 1,274
A. lyrata 2,300 945 4,459 1,257
B. oleracea 3,194 1,693 6,512 2,541
C. rubella 2,223 883 4,602 1,226
T. parvula 2,293 976 4,910 1,345
Total 3,244 1,758 6,687 2,642
36
Supplementary Table 17. Enriched GO terms of over retained genes in B. rapa
genome V2.0. (excel table)
37
Supplementary Table 18. Enriched GO terms of under retained genes in B. rapa
genome V2.0. (excel table)
38
Supplemental Figures
Supplemental Figure 1. An example of integrating information from linkage map
and synteny map between B. rapa and (A) A. thaliana or (B) S. parvula to determine
the order of scaffolds in local regions of chromosomes (A04). Red plots represent the
genetic map of the RILs population, and green plots represent the genetic map of the
F1DH population.
39
Supplemental Figure 2. Scatterplot showing the correspondence between physical
position and genetic distance of RILs (blue) and F1DH (red) populations in ten
chromosomes of B. rapa.
40
Supplemental Figure 3. Distribution of total tandem genes and tandem arrays along
chromosomes in B. rapa genome V1.5 and V2.0. A 6-Mb sliding window with a 1-Mb
step was applied to screen tandem genes and tandem arrays across ten chromosomes.
The y-axis denotes the number of tandem genes and tandem arrays in each window.
41
Supplemental Figure 4. Comparisons of gene annotations in the two assemblies of B.
rapa genome. (A) Classification of corresponding gene-pairs in the two gene
annotations. (B) Illustration of the classification of gene-pairs between the two
assemblies, namely one-to-one gene-pairs, one-to-multiple gene arrays, and
multiple-to-one gene-pairs. (C) Examples of the sequence alignment of one-to-one
gene arrays show the variations in the two genesets. The first shows only a point
mutation between the two protein sequences; the second shows a gene in V2.0 to be
annotated with more coding sequences than V1.5; and the third shows a gene in V1.5
to have more coding sequences than V2.0. The fourth example shows the combined
variations of the several cases given above.
42
Supplemental Figure 5. Comparisons of TEs between B. rapa genome V1.5 and
V2.0. (A) Frequency distributions of TE subgroups in the two assemblies of B. rapa
genome. (B) Number and frequency distributions of TE with different lengths in the
two assemblies. The x-axis stands for the lengths of TEs, and the y-axis denotes the
TE counts (left) and percentage (right).
43
Supplemental Figure 6. Distribution of TEs along ten chromosomes in B. rapa
genome V1.5 and V2.0. A 500-kb sliding window with a 100-kb step was applied to
screen TEs across the ten chromosomes. The y-axis denotes the ratio of TE sequences
in each window.
44
Supplemental Figure 7. Examples showing more TEs were annotated in
corresponding intergenic regions of the two genome assemblies of B. rapa. Blue
triangles represent TEs located in the intergenic regions. Red arrows link
corresponding gene pairs that show syntenic relationships. (A) A new TE in V2.0
shows limited similarity (coverage=71.9%) to intergenic region between genes
Bra000493 and Bra000494. (B) Three more TEs were annotated between
BraA04001951 and BraA04001952 than between Bra034310 and Bra034311.
45
Supplemental Figure 8. The density of orthologous genes in three sub-genomes (LF,
MF1 and MF2) of B. rapa genome V2.0 compared to A. thaliana. The x-axis denotes
seven ancestral chromosomes of Brassica anceatral genomes. The y-axis denotes the
percentage of retained orthologous genes in B. rapa sub-genomes around each A.
thaliana gene, with a total window size of 1,001 genes, 500 genes flanking each side
of a certain gene. Part of D block was lost in sub-genome LF, resulted in the
decreased gene density in the head of tPCK7.
46
Supplemental Figure 9. RNA+ TEs show negative association with expression levels
of gene doublets between sub-genome LF and MF1, LF and MF2. (A) Dominantly
expressed genes from sub-genome LF show low level of RNA+ TEs in the 2 Kb of 5′
UTR regions. dLF denotes dominantly expressed genes from LF. (B) Dominantly
expressed genes from sub-genome MF1 show low level of RNA+ TEs in the 2 Kb of
5′ UTR regions. dMF1 denotes dominantly expressed genes from MF1. 1,046 and 665
gene doublets of dLF-MF1 and LF-dMF1 were considered respectively. (C)
Dominantly expressed genes from sub-genome LF show low level of RNA+ TEs in
the 2 kb of 5′ UTR regions. (D) Dominantly expressed genes from sub-genome MF1
show low level of RNA+ TEs in the 2 kb of 5′ UTR regions. dMF2 denotes
dominantly expressed genes from MF2. 878 and 533 gene doublets of dLF-MF2 and
LF-dMF2 were considered respectively.
47
Supplemental References
.
Birney, E., Clamp, M., and Durbin, R. (2004). GeneWise and genomewise. Genome research
14:988-995.
Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. (2011). Scaffolding pre-assembled
contigs using SSPACE. Bioinformatics 27:578-579.
Cheng, F., Sun, C., Wu, J., Schnable, J., Woodhouse, M.R., Liang, J., Cai, C., Freeling, M., and Wang, X.
(2016). Epigenetic regulation of subgenome dominance following whole genome triplication
in Brassica rapa. The New phytologist 211:288-299.
Cheng, F., Wu, J., Fang, L., and Wang, X. (2012). Syntenic gene analysis between Brassica rapa and
other Brassicaceae species. The Brassica Genome:5.
English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., and Worley,
K.C. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read
sequencing technology. PloS one 7:e47768.
Feng, C., Jian, W., Lu, F., Silong, S., Bo, L., Ke, L., Guusje, B., and Xiaowu, W. (2012). Biased Gene
Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa. PloS
one 7:e36442.
Freeling, M. (2009). Bias in plant gene content following different sorts of duplication: tandem,
whole-genome, segmental, or by transposition. Annual review of plant biology 60:433-453.
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L.,
Raychowdhury, R., and Zeng, Q. (2011). Trinity: reconstructing a full-length transcriptome
without a genome from RNA-Seq data. Nature biotechnology 29:644.
Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., and Bateman, A. (2005). Rfam:
annotating non-coding RNAs in complete genomes. Nucleic acids research 33:D121-D124.
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Hannick, L.I., Maiti, R., Ronning,
C.M., Rusch, D.B., and Town, C.D. (2003). Improving the Arabidopsis genome annotation
using maximal transcript alignment assemblies. Nucleic acids research 31:5654-5666.
Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White, O., Buell, C.R., and Wortman,
J.R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the
Program to Assemble Spliced Alignments. Genome biology 9:R7.
Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty,
L., and Duquenne, L. (2009). InterPro: the integrative protein signature database. Nucleic
acids research 37:D211-D215.
Larkin, M.A., Blackshields, G., Brown, N., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F.,
Wallace, I.M., Wilm, A., and Lopez, R. (2007). Clustal W and Clustal X version 2.0.
Bioinformatics 23:2947-2948.
Larose, D.T. (2005). k‐Nearest Neighbor Algorithm. Discovering Knowledge in Data: An Introduction
to Data Mining:90-106.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv
preprint arXiv:1303.3997.
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., and Wang, J. (2009a). SNP detection for
massively parallel whole-genome resequencing. Genome research 19:1124-1132.
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009b). SOAP2: an improved
ultrafast tool for short read alignment. Bioinformatics 25:1966-1967.
48
Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA
genes in genomic sequence. Nucleic acids research 25:955-964.
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., and Liu, Y. (2012).
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
GigaScience 1:1-6.
Parra, G., Bradnam, K., and Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in
eukaryotic genomes. Bioinformatics 23:1061-1067.
Salmela, L., and Rivals, E. (2014). LoRDEC: accurate and efficient long read error correction.
Bioinformatics:btu538.
She, R., Chu, J.S.-C., Wang, K., Pei, J., and Chen, N. (2009). GenBlastA: enabling BLAST to identify
homologous gene sequences. Genome research 19:143-149.
Simao, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M. (2015). BUSCO:
assessing genome assembly and annotation completeness with single-copy orthologs.
Bioinformatics 31:3210-3212.
Sun, C., Wu, J., Liang, J., Schnable, J.C., Yang, W., Cheng, F., and Wang, X. (2015). Impacts of
Whole-Genome Triplication on MIRNA Evolution in Brassica rapa. Genome biology and
evolution 7:3085-3096.
Sun, X., Liu, D., Zhang, X., Li, W., Liu, H., Hong, W., Jiang, C., Guan, N., Ma, C., and Zeng, H. (2013).
SLAF-seq: an efficient method of large-scale de novo SNP discovery and genotyping using
high-throughput sequencing. PLoS One 8:e58700.
Tarailo‐Graovac, M., and Chen, N. (2009). Using RepeatMasker to identify repetitive elements in
genomic sequences. Current Protocols in Bioinformatics:4.10. 11-14.10. 14.
Van Ooijen, J. (2006). JoinMap® 4, Software for the calculation of genetic linkage maps in experimental
populations. Kyazma BV, Wageningen 33:10.1371.
Wang, F., Li, L., Li, H., Liu, L., Zhang, Y., Gao, J., and Wang, X. (2012). Transcriptome analysis of rosette
and folding leaves in Chinese cabbage using high-throughput RNA sequencing. Genomics
99:299-307.
Wang, X., and Cheng, F. (2016). Epigenetic regulation of subgenome dominance following whole
genome triplication in Brassica rapa. New Phytologist:1.
Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.-H., Bancroft, I., and Cheng, F.
(2011). The genome of the mesopolyploid crop species Brassica rapa. Nature genetics
43:1035-1039.
Xu, Z., and Wang, H. (2007). LTR_FINDER: an efficient tool for the prediction of full-length LTR
retrotransposons. Nucleic acids research 35:W265-W268.
Yu, X., Wang, H., Zhong, W., Bai, J., Liu, P., and He, Y. (2013). QTL mapping of leafy heads by genome
resequencing in the RIL population of Brassica rapa. PloS one 8:e76059.