1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...

48
1 Supplemental Information 1 2 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 3 and gene re-annotation 4 5 Chengcheng Cai 1 , Xiaobo Wang 1 , Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng 6 Cheng * , and Xiaowu Wang * 7 8 Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, 9 Zhongguancun Nandajie No.12, Haidian district, Beijing 100081, P.R. China. 10 1 These authors contributed equally to this article. 11 * Corresponding authors. 12 13 Contents 14 Supplemental Methods 15 Supplemental Tables 16 Supplemental Figures 17 Supplemental References 18 19 20

Transcript of 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...

Page 1: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

1

Supplemental Information 1

2

Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 3

and gene re-annotation 4

5

Chengcheng Cai1, Xiaobo Wang

1, Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng 6

Cheng*, and Xiaowu Wang

* 7

8

Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, 9

Zhongguancun Nandajie No.12, Haidian district, Beijing 100081, P.R. China. 10

1These authors contributed equally to this article. 11

*Corresponding authors. 12

13

Contents 14

Supplemental Methods 15

Supplemental Tables 16

Supplemental Figures 17

Supplemental References 18

19

20

Page 2: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

2

Supplemental Methods 21

22

Genome de novo assembly 23

55 Gb (114×) paired-end reads data were generated from two libraries with insert size 24

of 250 bp. The read length is 150 bp. 8.7 Gb (18×) reads data were generated from 25

two mate-paired libraries with insert size of 20 Kb and 40 Kb, respectively. These 26

data were all produced using the Illumina Hiseq2500 sequencing platform. 27

Furthermore, 6.5 Gb (13.4×) single-molecule sequencing data were generated (PacBio 28

reads) with an average length of 12 Kb. In addition to these newly generated data, all 29

datasets used to assemble B. rapa genome V1.5 (Wang et al., 2011) were also 30

incorporated in this genome re-assembly. 31

Low quality reads of Illumina raw data were filtered as follows: 1) read 32

containing “N”; 2) read whose average Phred-like score was less than 20; 3) trimmed 33

read shorter than 50 bp, after a test of trimming the 3′ terminal nucleotides whose 34

Phred-like score is less than 13. Duplicated reads generated from one amplicon were 35

removed to keep only one copy. Meanwhile, the sequencing errors in PacBio reads 36

were corrected using the 150 bp paired-end Illumina reads using tool LoRDEC 37

(Salmela and Rivals, 2014) with parameters “-k 19 -s 3.” Those clean reads were then 38

used for genome assembly. 39

In the process of genome assembly, first, all the 150 bp paired-end reads (114×) 40

were used to assemble contigs by SOAPDenovo2 (Luo et al., 2012) with k-mer = 91 41

bp. Next, the large insert-size mate-paired libraries (>2 Kb) used in the previous 42

assembly plus two newly generated libraries (20 Kb and 40 Kb) were used to link 43

contigs into scaffolds using SSPACE (Boetzer et al., 2011). Then, publicly available 44

BAC-ends sequences were mapped to scaffolds using BLAST. These scaffolds were 45

further linked into super-scaffolds according to the paired relationships of BAC-ends 46

sequences. Finally, all the paired-end Illumina reads from short insert size libraries 47

(<1 Kb) were used to close gaps in these super-scaffolds using GapCloser (Luo et al., 48

2012) with default parameters. The single-molecule PacBio reads were further used to 49

close gaps in these super-scaffolds using PBjelly_V15.2.20 (English et al., 2012) with 50

Page 3: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

3

parameters “--minGap 1 -minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 51

-maxScore -500 -nproc 5 -noSplitSupreads”. 52

The quality of the genome V1.5 and V2.0 were validated through CEGMA 53

analysis (Parra et al., 2007). 458 core eukaryotic genes (CEG database) were 54

BLASTed to the genome assembly, which showed hits of 455 (99.34%) and 454 55

(99.13%) CEG proteins for all 458 genes in CEG in the genome V1.5 and V2.0, 56

respectively. The genomes were also validated by matching B. rapa ESTs 57

downloaded from NCBI, which showed that 99.16% and 99.34% ESTs to be 58

supported by the assembled genome V1.5 and V2.0, respectively. 59

60

Construction of pseudo-chromosomes 61

To correct mis-assembled scaffolds and assign corrected scaffolds to ten 62

chromosomes of B. rapa, two genetic maps were constructed for a previously reported 63

RILs (recombinant inbred lines) population (Yu et al., 2013) and a FIDH (F1 doubled 64

haploid) population of B. rapa. The RILs population contains 150 recombinant inbred 65

lines (RILs) derived from a cross between a heading Chinese cabbage (B. rapa ssp. 66

pekinensis cv Bre) and a non-heading B. rapa (B. rapa ssp. chinensis cv Wut) (Yu et 67

al., 2013). Low-depth sequencing data of RILs population were downloaded from 68

NCBI (http://www.ncbi.nlm.nih.gov/). The F1DH population included 120 F1DH 69

lines derived from a cross between a Chinese cabbage DH line Z16 (B. rapa ssp. 70

pekinensis) and a rapid cycling line L144 (B. rapa ssp. oleifera). The F1DH 71

population was previously used to develop a genetic map with InDels markers to 72

construct pseudo-chromosomes for B. rapa genome V1.5 (Wang et al., 2011). In this 73

work, SLAF-Seq (Sun et al., 2013) was perforemed on the 120 F1DH lines and their 74

two parents. 75

Two high-density linkage maps were constructed following the below procedures. 76

First, raw data were filtered to remove low-quality reads with the aforementioned 77

rules. Next, clean reads of the two parents of a population were aligned to reference 78

genome of B. rapa (V2.0) using SOAPaligner (SOAP2) (Li et al., 2009b) with 79

Page 4: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

4

parameters “-m 100 –x 1000 –r 0” and those paired-end reads with unique hits to the 80

reference were preserved. SOAPsnp (Li et al., 2009a) was then used to call SNPs with 81

parameter “-L 100.” Confident SNPs were selected using following criteria: 1) 82

average quality score was over 20; 2) at least covered by three reads; 3) only 83

homozygous genotype was considered. After that, SNP loci showing polymorphism 84

between two parents were kept for further analysis. Each line of the population was 85

aligned to the B. rapa genome V2.0 using SOAP2 (Li et al., 2009b) with the same 86

parameter. Genotypes were identified for each line at those SNP loci showing 87

polymorphism between two parents. These ungenotyped loci were imputed using 88

algorithm of k-NN (the k nearest neighbors) (Larose, 2005). The parental inheritance 89

of each SNP loci was then determined for each line of the population, and genomic 90

regions showing complete linkage disequilibrium in the whole population were 91

merged to form bin-markers. Finally, these bin-markers were submitted to software 92

JoinMap (version 4.0) (Van Ooijen, 2006) to construct the genetic map. This process 93

was repeated to both the RILs and F1DH populations to build two genetic maps. 94

28 mis-assembled scaffolds were corrected with information from the two 95

genetic maps. The 28 scaffolds were split and their fragments were assigned to 96

chromosomes as other scaffolds. Information of the RILs genetic map was served as 97

the principal evidence (higher density of resequencing) to direct the orders of those 98

assembled scaffolds. The genetic map built on SLAF-Seq of F1DH population was 99

used to assist the ordering and orientation of scaffolds when there was limited or no 100

recombination information in the RILs map. The physical genomic syntenic 101

relationships between B. rapa and A. thaliana or S. parvula were also used as 102

evidences in determining the order and orientation of scaffolds in local regions. 103

104

Constructing transcripts from mRNA-Seq data 105

29 Gb mRNA-Seq data were generated from eight tissues of B. rapa: aboveground 106

stem, flower, small anther, middle anther, tender leaf, middle leaf, petiole, and seed 107

pod. Detailed information of each tissue is listed in Supplemental Table 11. In 108

addition to the newly generated data, mRNA-Seq datasets from another two 109

Page 5: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

5

previously reported datasets were also incorporated. The first dataset was used to 110

verify gene models in the first release of B. rapa genome (Wang et al., 2011). The 111

second dataset was used in the study of gene expression during leaf development of 112

the Chinese cabbage (Wang et al., 2012). 113

Both de novo and genome-guided approaches were used to assemble these large 114

volumes of mRNA-Seq reads into transcripts. For the de novo approach, the Trinity 115

(Grabherr et al., 2011) tool package was used with default parameters to assemble 116

mRNA-Seq reads. For the genome-guided approach, several Perl scripts 117

(alignReads.pl, prep_rnaseq_alignments_for_genome_assisted_assembly.pl, 118

GG_write_trinity_cmds.pl, ParaFly and GG_trinity_accession_incrementer.pl) 119

embedded in Trinity were used to complete the alignment and assembly processes 120

step by step. 121

122

Genome annotation 123

Gene prediction process consisted of the following steps: 1) ab initio gene modeling; 124

2) detection of homologous genes; 3) transcript fragments mapping; and 4) merging 125

of the three predictions. Repeat sequences were masked before gene annotation. In the 126

first step, two gene predictors, Augustus and GlimmerHMM, were used for de novo 127

gene prediction. For reports of both tools, predicted genes whose coding regions were 128

shorter than 150 bp were filtered. In the second step, two datasets, A. thaliana 129

(TAIR10) and C. rubella protein sequences were collected. GenBlastA (She et al., 130

2009) was used to align A. thaliana and C. rubella protein sequences to the masked B. 131

rapa genome with parameters “-e 1e-5.” The candidate gene sequences generated by 132

genBlastA along with homologous proteins were further processed by GeneWise 133

(Birney et al., 2004) with default parameters to predict gene structure. For the third 134

step, three kinds of evidence were used to perform transcripts-assisted gene prediction: 135

1) a set of 214,482 B. rapa ESTs downloaded from NCBI; 2) de novo assembled 136

transcripts; and 3) genome-guided assembled transcripts. These transcript-related 137

Page 6: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

6

sequences were aligned to the masked genome and results were processed by PASA 138

(Haas et al., 2003) with default parameters. 139

In the final step, EVidenceModeler (EVM) (Haas et al., 2008) was used to merge 140

all results from de novo prediction, homologous prediction, and PASA annotation, 141

into a weighted consensus gene dataset (weight values: 1 for de novo prediction, 5 for 142

homologous prediction, and 10 for PASA annotation). The EVM results were further 143

filtered with following criteria: 1) genes with protein-coding regions shorter than 144

150 bp; 2) incomplete genes without start and stop codons; and 3) genes with 145

premature termination in translation process. Finally, PASA was used to improve the 146

EVM gene models by modifying exons, adding UTRs, and determining alternatively 147

spliced isoforms. Prediction of gene sets V1.5 and V2.0 were further assessed and 148

compared using BUSCO analysis (Simao et al., 2015) (Supplemental Table 12). 149

After gene model prediction, gene function annotation was performed. 150

InterProScan (Hunter et al., 2009) was used to annotate motifs and domains by 151

comparing predicted genes with available databases such as PROSITE, PRINTS, 152

Pfam, ProDom, and SMART. The GO annotation was extracted from the output of 153

InterProScan. The predicted proteins were further aligned to the Swiss-Prot, TrEMBL, 154

and KEGG databases using BLASTP at E value 1×10-5

to obtain other annotation 155

information. All of these four annotation datasets are freely available through: 156

http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/. 157

158

Annotation of transposable elements and non-coding RNAs 159

Transposable elements (TEs) were annotated using RepeatMasker (Tarailo‐Graovac 160

and Chen, 2009). In order to compare features of TEs between B. rapa genome V1.5 161

and V2.0, TEs were re-annotated on both genome versions using the same methods 162

described elsewhere (Wang and Cheng, 2016). For these newly annotated TEs in V2.0, 163

two situations were observed. 1) Extra TEs in V2.0 without counterparts found in 164

corresponding regions of V1.5 (Supplemental Figure 7A); 2) More TEs were 165

annotated in V2.0 than those in V1.5 in corresponding regions (Supplemental 166

Page 7: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

7

Figure 7B). Full length LTRs were identified using LTR_Finder with default 167

parameters (Xu and Wang, 2007). 168

169

According to the structural characteristics of tRNA, tRNAscan-SE-1.23 (Lowe and 170

Eddy, 1997) was used to identify tRNA sequences. rRNA was located by aligning 171

known full-length rRNAs of plants onto the B. rapa genome. snRNA sequences were 172

predicted using Rfam-9.1 (Griffiths-Jones et al., 2005). miRNA was predicted using 173

the similar methods reported by Sun et al (Sun et al., 2015). Totally, we annotated 174

different kinds of ncRNAs that accounted for ≈0.306% of the updated genome 175

(Supplemental Table 13). 176

177

Sub-genomes reconstruction and analysis 178

Based syntenic relationship between B. rapa genome V2.0 and A. thaliana, the least 179

fractionated (LF), the medium fractionated (MF1) and the most fractionated (MF2) 180

sub-genomes of B. rapa V2.0 were built. Detailed information of paralogous genes in 181

three sub-genomes was listed in Supplemental Table 14. The LF sub-genome 182

maintained more gene copies than MF1/2 sub-genome. Part of block D was detected 183

to be lost which resulted in the low gene density in LF sub-genome at the beginning 184

region of ancestral chromosome tPCK7 (Supplemental Figure 8). 185

2,058 fully retained genes in all three sub-genomes were identified. After 186

removing TA (tandem) genes and MT (one gene in B.rapa genome V1.5 or in A. 187

thaliana, while multiple homologous genes in V2.0) genes, genes from 1,307 fully 188

retained homoeologs were chosen to analyze patterns of gene expression among 189

sub-genomes using the same methods described elsewhere (Feng et al., 2012). Genes 190

in sub-genome LF are dominantly expressed over genes in MF1 and MF2 191

(Supplemental Table 15). 192

The relationship between twenty-four nucleotides small RNA targeted TEs and 193

the expression pattern of gene doublets was analyzed using similar methods reported 194

previously (Cheng et al., 2016). mRNA-Seq data of three organs (root, leaf and stem) 195

and small RNA data for the leaf used in this study were retrieved from 196

Page 8: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

8

http://brassicadb.org/brad/datasets/pub/srna2/. For sub-genome LF and MF1, LF and 197

MF2, small RNA targeted TEs (RNA+ TEs) showed negative association with 198

expression levels of gene doublets, namely, dominantly expressed genes from a 199

sub-genome showed low level of RNA+ TEs in the 2 kb of 5′ UTR regions 200

(Supplemental Figure 9). 201

202

Comparison of gene annotation between V1.5 and V2.0 203

Genome synteny analysis was performed between the two versions using SynOrths 204

(Cheng et al., 2012) to determine corresponding gene pairs (i.e. the same gene loci 205

in two different assemblies) and tandem gene arrays. With the results reported by 206

SynOrths, relationships of tandem gene arrays in the two genome versions were 207

classified into three categories: 1) tandem arrays in one version are syntenic to tandem 208

arrays in the other version; 2) tandem arrays in one version correspond to non-tandem 209

arrays in the other one; 3) tandem arrays in one version can’t map to any gene in the 210

other version. Non-tandem genes in the two predictions were classified into four 211

categories: 1) one gene in V1.5 is the counterpart of one gene in V2.0 (one-to-one 212

gene pairs); 2) one gene in V1.5 corresponds to several genes (≥2) in V2.0 213

(one-to-multiple genes); 3) several genes (≥2) in V1.5 corresponds to one gene in 214

V2.0 (multiple-to-one genes) (Supplemental Figure 4B); and 4) genes specific to each 215

annotation (genes with no counterparts in the other versuib). 216

To further analyze the differences in genes between these two predictions, a 217

detailed analysis was performed for each of the above four gene categories. The 218

protein sequences of one-to-one gene pairs were compared using Clustalx (Larkin et 219

al., 2007) to assess variations in gene coding sequences. For one-to-multiple and 220

multiple-to-one genes, mRNA-Seq reads were mapped to the CDs sequences of these 221

gene pairs using BWA (Li, 2013) to find evidence supporting whether these genes 222

were correctly annotated. The work focused on paired-end reads with one read 223

mapped to one of the multiple genes, while the other read mapped to another one of 224

the multiple genes. It indicates that the two genes should be re-annotated as one gene. 225

Page 9: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

9

For these annotation and version-specific genes (no counterparts), the following two 226

analysis was performed to assess their reliability: 1) mRNA-Seq reads were mapped 227

to the CDSs of these group-specific genes using BWA; 2) They were BLASTed 228

(BLASTP) against proteins of five other Brassica species, with an E-value of 1×10-5

229

(Supplemental Table 16). Mapping of mRNA-Seq reads to CDS of these genes found 230

that 3,470 genes in V1.5 and 6,350 in V2.0 had mRNA-Seq evidences to support the 231

reliability of these version-specific genes. 3,244 genes in V1.5 and 6,687 genes in 232

V2.0 had BLASTP hits to proteins of other Brassica species (Supplemental Table 16). 233

234

GO term enrichment analysis 235

For each GO term, its occurrence in the 3-copy gene set and in combined 1- and 236

2-copy geneset were counted in V2.0. And the occurrence in the 1-copy gene set and 237

in combined 2- and 3-copy gene set were also counted as reported previously (Wang 238

et al., 2011). Fisher’s exact tests “1+2 vs 3” and “1 vs 2+3” were performed to test the 239

over retention of 3-copy and the under retention of 1-copy orthologous GO terms, 240

respectively. The results showed that genes encoding subunits of proteasomes, 241

ribosomes, and transcription factor complexes were over retained (Supplemental 242

Table 17). While genes associated with DNA repair, binding and chloroplast were 243

under retained (Supplemental Table 18). These finding were in accordance with the 244

gene balance hypothesis which predicts that genes with products interact with other 245

gene products are more likely to be over retained and otherwise they are more likely 246

to be under retained (Freeling, 2009). Genes associated with response to environment 247

factors and plant hormones were also over retained (Supplemental Table 17). 248

Using similar method, GO enrichment of tandem gene arrays in B. rapa genome 249

V2.0 were also analyzed. The occurrence of each GO term in the tandem arrays and 250

non-tandem genes were counted. Fisher’s exact test was performed to test whether a 251

GO term was enriched in the tandem gene arrays. Results show that genes related to 252

defense response, membrane functions, and different kinds of enzyme activity were 253

enriched in tandem gene arrays (Supplemental Table 9). 254

255

Page 10: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

10

256

Accession numbers 257

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank 258

under the accession AENI00000000. The version described in this paper is version 259

AENI02000000. The genome assembly and gene annotation results were also freely 260

available through BRAD website (Brassica database, 261

http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/). 262

263

264

Page 11: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

11

Supplemental Tables 265

266

Supplemental Table 1. Summary of Illumina sequencing data used in the assembly 267

of B. rapa genome V2.0. 268

Insert Size

(bp)

Read Length

(bp)

Raw Data

(bp)

Clean Data

(bp)

Sequence

Depth (X)

250* 150 73,219,780,500 55,479,375,000 114.39

20,000* 44/49/90 1,890,331,4720 5,638,004,562 11.62

40,000* 90 16,801,802,820 3,105,338,220 6.4

PacBio* / 6,473,983,398 6,412,500,309** 13.22

200 44/75/100 19,264,162,762 13,697,925,048 28.24

500 44/75 7,807,414,874 6,407,407,132 13.21

2,000 44/75 3,581,696,974 3,322,057,634 6.85

5,000 44 3,212,095,304 2,297,055,728 4.74

8,000 44 2,469,275,072 1,255,661,264 2.59

10,000 44 6,106,076,328 4,530,229,440 9.34

14,000 44 2,066,000,464 771,807,344 1.59

BAC-end / 154,631,945 / /

*: newly generated in this project; 269

**: corrected previous Illumina reads. 270

271

Page 12: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

12

Supplemental Table 2. Comparisons of statistics between B. rapa genome assembly V1.5 and V2.0.

V1.5 V2.0 (Scaffold Refined with PacBio)

Contig

Size

(bp)

Number

of

Contigs

Scaffold

Size

(bp)

Number

of

Scaffolds

Contig

Size

(bp)

Number

of

Contigs

Scaffold

Size

(bp)

Number

of

Scaffolds

N90 9,900 6,351 308,587 180 5,939 8,456 25,622 349

N80 19,011 4,395 635,699 118 19,181 5,373 975,164 89

N70 27,303 3,202 1,100,015 85 30,559 3,878 1,852,239 62

N60 36,355 2,334 1,372,560 62 41,205 2,850 2,610,373 44

N50 46,088 1,663 1,846,652 44 52,684 2,063 3,377,735 31

Total Size 273,100,332

283,810,373

366,413,862

389,189,875

Total Number

(>=100 bp)

51,647

40,576

96,883

86,986

Total Number

(>=2 kb)

9,553

821

10,673

2,178

Page 13: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

13

Supplemental Table 3. Comparisons of statistics on B. rapa genome assembly V2.0

before and after using PacBio data.

Contig (before using

PacBio Data)

Scaffold (before using

PacBio Data)

Scaffold Refined

with PacBio Data

Size (bp) Number Size (bp) Number Size (bp) Number

N90 2,566 14,739 40,173 354 25,622 349

N80 10,065 8,939 784,999 97 975,164 89

N70 16,336 6,370 1,636,670 66 1,852,239 62

N60 22,619 4,640 2,289,139 47 2,610,373 44

N50 29,362 3,348 2,973,276 33 3,377,735 31

Total size 333,541,035

370,219,313

389,189,875

Total

Number

(>100bp)

104,563

87,323

86,986

Total

Number

(>2kb)

15,785

1,697

2,178

Gaps

36,678,278

22,776,013

Page 14: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

14

Supplemental Table 4. Statistics of two genetic maps and assignment of assembly V2.0 to ten chromosomes of B. rapa with the two maps.

RILs DH

Pseudo-

chromosome

No. of

Binmarkers

Position

(cM)

No. of

Linked

Scaffolds

Length

(Mb)

No. of

Binmarkers

Position

(cM)

No. of

Linked

Scaffolds

Length

(Mb)

A01 196 162.14 29 33.7 133 128.911 27 33.3

A02 172 133.611 15 30.4 109 140.507 14 30.0

A03 228 178.902 5 36.4 111 170.588 7 39.3

A04 108 123.855 14 23.4 85 95.46 14 23.4

A05 214 117.505 30 36.0 150 153.926 25 35.5

A06 256 115.225 25 39.7 137 137.036 18 37.3

A07 149 123.353 10 29.7 117 148.978 11 34.8

A08 116 115.619 15 27.7 111 127.699 14 27.4

A09 264 136.684 31 54.4 143 188.759 29 53.9

A10 102 109.837 6 18.5 71 99.652 9 21.2

Total 1,805 1,316.731 146 329.9 1,167 1,391.516 138 336.1

Page 15: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

15

Supplemental Table 5. Lists of mis-assembled scaffolds and their splicing information.

Chr ID Sca ID Sca splited ID Start End Order Supporting RILs

Binmarkers

A05 Scaffold000001 Scaffold000001_1 1 1568174 + 12 BinMarkers

A09 Scaffold000001 Scaffold000001_2 1568175 2839151 + 1 BinMarkers

A09 Scaffold000001 Scaffold000001_3 2839152 6711068 - 8 BinMarkers

A01 Scaffold000001 Scaffold000001_4 6711069 7145794 + 3 BinMarkers

A03 Scaffold000001 Scaffold000001_5 7145795 16627912 + 50 BinMarkers

A07 Scaffold000003 Scaffold000003_1 1 4706983 + 18 BinMarkers

A09 Scaffold000003 Scaffold000003_2 4706984 9295739 + 8 BinMarkers

A05 Scaffold000003 Scaffold000003_3 9295740 10101789 - 4 BinMarkers

A06 Scaffold000003 Scaffold000003_4 10101790 11093742 +

A10 Scaffold000005 Scaffold000005_1 1499897 8038702 ? 52 BinMarkers

A08 Scaffold000005 Scaffold000005_2 1 1499896 ? 4 BinMarkers

A01 Scaffold000006 Scaffold000006_1 1 879152 + 5 BinMarkers

A07 Scaffold000006 Scaffold000006_2 879153 4858235 + 18 BinMarkers

A02 Scaffold000006 Scaffold000006_3 4858236 6679097 + 14 BinMarkers

A05 Scaffold000006 Scaffold000006_4 6679123 7844833 +

A01 Scaffold000007 Scaffold000007_1 1 2257556 + 5 BinMarkers

A09 Scaffold000007 Scaffold000007_2 2257557 5817461 - 9 BinMarkers

A08 Scaffold000007 Scaffold000007_3 5817462 7741191 ? 1 BinMarkers

A06 Scaffold000008 Scaffold000008_1 1 784055 - 5 BinMarkers

A03 Scaffold000008 Scaffold000008_2 784056 6702624 - 19 BinMarkers

Page 16: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

16

A06 Scaffold000008 Scaffold000008_3 6702625 7319999 + 5 BinMarkers

A05 Scaffold000009 Scaffold000009_1 5502703 7126136 ? 14 BinMarkers

A02 Scaffold000009 Scaffold000009_2 1801030 5502702 ? 20 BinMarkers

A09 Scaffold000009 Scaffold000009_3 1 1801029 ? 5 BinMarkers

A07 Scaffold000010 Scaffold000010_1 1 2672674 + 17 BinMarkers

A01 Scaffold000010 Scaffold000010_2 2672675 6707728 + 11 BinMarkers

A02 Scaffold000012 Scaffold000012_1 3977006 6361331 ? 18 BinMarkers

A09 Scaffold000012 Scaffold000012_2 1 3977005 ? 10 BinMarkers

A04 Scaffold000014 Scaffold000014_1 1 1582907 + 2 BinMarkers

A05 Scaffold000014 Scaffold000014_2 1582908 3506367 + 9 BinMarkers

A06 Scaffold000014 Scaffold000014_3 3506368 5718963 + 13 BinMarkers

A09 Scaffold000016 Scaffold000016_1 1 4330173 - 6 BinMarkers

A06 Scaffold000016 Scaffold000016_2 4330174 5621156 + 5 BinMarkers

A01 Scaffold000022 Scaffold000022_1 3811411 4302067 ? 3 BinMarkers

A08 Scaffold000022 Scaffold000022_2 1 3811410 ? 11 BinMarkers

A07 Scaffold000029 Scaffold000029_1 2332209 4120263 + 4 BinMarkers

A02 Scaffold000029 Scaffold000029_2 1 2332208 + 3 BinMarkers

A01 Scaffold000030 Scaffold000030_1 1 1708633 - 8 BinMarkers

A02 Scaffold000030 Scaffold000030_2 1708634 3289170 - 8 BinMarkers

A05 Scaffold000031 Scaffold000031_1 1 932619 + 8 BinMarkers

A08 Scaffold000031 Scaffold000031_2 932620 3246927 - 17 BinMarkers

A06 Scaffold000034 Scaffold000034_1 1 3966651 + 2 BinMarkers

A05 Scaffold000034 Scaffold000034_2 3966652 4925386 + 7 BinMarkers

A04 Scaffold000040 Scaffold000040_1 1 1017356 + 2 BinMarkers

A08 Scaffold000040 Scaffold000040_2 1017357 2812160 ? 1 BinMarkers

A07 Scaffold000042 Scaffold000042_2 1249768 2610373 - 2 BinMarkers

Page 17: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

17

A06 Scaffold000047 Scaffold000047_1 1 1339929 - 10 BinMarkers

A09 Scaffold000047 Scaffold000047_2 1339930 2395810 - 5 BinMarkers

A05 Scaffold000048 Scaffold000048_1 1 2314489 - 14 BinMarkers

A04 Scaffold000048 Scaffold000048_2 2314490 2394815 ? 1 BinMarkers

A05 Scaffold000051 Scaffold000051_1 1 194542 - 2 BinMarkers

A04 Scaffold000051 Scaffold000051_2 194543 2129242 + 12 BinMarkers

A05 Scaffold000059 Scaffold000059_1 1 1942021 +

A07 Scaffold000060 Scaffold000060_1 472477 1852239 ? 6 BinMarkers

A09 Scaffold000060 Scaffold000060_2 1 472476 ? 3 BinMarkers

A01 Scaffold000064 Scaffold000064_1 1 413784 -

A05 Scaffold000064 Scaffold000064_2 413785 1779331 + 5 BinMarkers

A08 Scaffold000069 Scaffold000069_1 1 994093 - 8 BinMarkers

A01 Scaffold000069 Scaffold000069_2 994094 1677521 - 4 BinMarkers

A05 Scaffold000091 Scaffold000091_1 1 370869 -

A06 Scaffold000091 Scaffold000091_2 370870 1095044 -

A05 Scaffold000103 Scaffold000103_1 469247 824434 ? 2 BinMarkers

A09 Scaffold000103 Scaffold000103_2 1 469246 ?

A01 Scaffold000108 Scaffold000108_1 1 452712 - 4 BinMarkers

A10 Scaffold000108 Scaffold000108_2 452713 661024 -

Page 18: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

18

Supplemental Table 6. Detailed information about the order and orientation of scaffolds on ten chromosomes of B. rapa.

Group

ID

Chr

ID

Splited Sca

Order

Sca

Order Start End Order

Split

Sca

Z16XL144DH

Linkage Map

RILs

Linkage Map

G01 A01 Scaffold000019 Scaffold000019 1 4924075 -

support

G01 A01 Scaffold000022_1 Scaffold000022 3811411 4302067 ? Split support 3 BinMarkers

G01 A01 Scaffold001067 Scaffold001067 1 6655 ?

no support

G01 A01 Scaffold000081 Scaffold000081 1 1248719 -

support

G01 A01 Scaffold000025 Scaffold000025 1 3737062 +

support

G01 A01 Scaffold000108_1 Scaffold000108 1 452712 - Split support 4 BinMarkers

G01 A01 Scaffold000077 Scaffold000077 1 1442098 -

support

G01 A01 Scaffold000084 Scaffold000084 1 1146124 +

support

G01 A01 Scaffold000007_1 Scaffold000007 1 2257556 + Split support 5 BinMarkers

G01 A01 Scaffold000110 Scaffold000110 1 625704 -

support

G01 A01 Scaffold000058 Scaffold000058 1 1903670 +

support

G01 A01 Scaffold000030_1 Scaffold000030 1 1708633 - Split support 8 BinMarkers

G01 A01 Scaffold000064_1 Scaffold000064 1 413784 - Split no support 4 BinMarkers

G01 A01 Scaffold000069_2 Scaffold000069 994094 1677521 - Split support 4 BinMarkers

G01 A01 Scaffold000079 Scaffold000079 1 1429233 +

support

G01 A01 Scaffold000010_2 Scaffold000010 2672675 6707728 + Split support 11 BinMarkers

G01 A01 Scaffold000001_4 Scaffold000001 6711069 7145794 + Split support 3 BinMarkers

G01 A01 Scaffold000006_1 Scaffold000006 1 879152 + Split support 5 BinMarkers

G01 A01 Scaffold000080 Scaffold000080 1 1365531 +

support

G01 A01 Scaffold000097 Scaffold000097 1 839554 +

support

Page 19: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

19

G01 A01 Scaffold000128 Scaffold000128 1 404483 -

support

G01 A01 Scaffold000116 Scaffold000116 1 540442 -

support

G01 A01 Scaffold000135 Scaffold000135 1 311370 -

support

G01 A01 Scaffold000104 Scaffold000104 1 785330 -

support

G01 A01 Scaffold000138 Scaffold000138 1 312602 +

support

G01 A01 Scaffold000155 Scaffold000155 1 209444 +

support

G01 A01 Scaffold000147 Scaffold000147 1 227696 -

support

G01 A01 Scaffold000151 Scaffold000151 1 191664 +

support

G01 A01 Scaffold000106 Scaffold000106 1 738834 +

support

G02 A02 Scaffold000068 Scaffold000068 1 1623172 -

support

G02 A02 Scaffold000055 Scaffold000055 1 2086791 +

support

G02 A02 Scaffold000113 Scaffold000113 1 565333 -

support

G02 A02 Scaffold000044 Scaffold000044 1 3377735 +

support

G02 A02 Scaffold000041 Scaffold000041 1 2704147 +

support

G02 A02 Scaffold000123 Scaffold000123 1 406925 +

support

G02 A02 Scaffold000009_2 Scaffold000009 1801030 5502702 ? Split support 20 BinMarkers

G02 A02 Scaffold000062 Scaffold000062 1 1818794 +

support

G02 A02 Scaffold000012_1 Scaffold000012 3977006 6361331 ? Split support 18 BinMarkers

G02 A02 Scaffold000083 Scaffold000083 1 1181007 -

support

G02 A02 Scaffold000006_3 Scaffold000006 4858236 6679097 + Split support 14 BinMarkers

G02 A02 Scaffold000130 Scaffold000130 1 415228 -

no support

G02 A02 Scaffold000029_2 Scaffold000029 1 2332208 + Split support 3 BinMarkers

G02 A02 Scaffold000030_2 Scaffold000030 1708634 3289170 - Split support 8 BinMarkers

G02 A02 Scaffold000021 Scaffold000021 1 4367232 -

support

G03 A07 Scaffold000072 Scaffold000072 1 1602627 -

support

G03 A07 Scaffold000060_1 Scaffold000060 472477 1852239 ? Split support 6 BinMarkers

Page 20: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

20

G03 A07 Scaffold000029_1 Scaffold000029 2332209 4120263 + Split support 4 BinMarkers

G03 A07 Scaffold000003_1 Scaffold000003 1 4706983 + Split support 18 BinMarkers

G03 A07 Scaffold000045 Scaffold000045 1 2590264 -

support

G03 A07 Scaffold000006_2 Scaffold000006 879153 4858235 + Split support 18 BinMarkers

G03 A07 Scaffold000042_2 Scaffold000042 1249768 2610373 - Split support 2 BinMarkers

G03 A07 Scaffold000010_1 Scaffold000010 1 2672674 + Split support 17 BinMarkers

G03 A07 Scaffold000017 Scaffold000017 1 5123980 -

support

G03 A07 Scaffold000020 Scaffold000020 1 4515445 -

support

G04 A09 Scaffold000011 Scaffold000011 1 6530455 -

support

G04 A09 Scaffold000060_2 Scaffold000060 1 472476 ? Split support 3 BinMarkers

G04 A09 Scaffold000153 Scaffold000153 1 199367 -

no support

G04 A09 Scaffold000085 Scaffold000085 1 1138684 +

support

G04 A09 Scaffold000131 Scaffold000131 1 386326 +

support

G04 A09 Scaffold000105 Scaffold000105 1 763457 +

support

G04 A09 Scaffold000063 Scaffold000063 1 1763648 -

support

G04 A09 Scaffold000125 Scaffold000125 1 413779 -

support

G04 A09 Scaffold000036 Scaffold000036 1 2972422 +

support

G04 A09 Scaffold000001_3 Scaffold000001 2839152 6711068 - Split support 8 BinMarkers

G04 A09 Scaffold000109 Scaffold000109 1 644992 +

support

G04 A09 Scaffold000506 Scaffold000506 1 21420 ?

no support

G04 A09 Scaffold000033 Scaffold000033 1 3178030 -

support

G04 A09 Scaffold000012_2 Scaffold000012 1 3977005 ? Split support 10 BinMarkers

G04 A09 Scaffold000001_2 Scaffold000001 1568175 2839151 + Split support 1 BinMarkers

G04 A09 Scaffold000016_1 Scaffold000016 1 4330173 - Split support 6 BinMarkers

G04 A09 Scaffold000073 Scaffold000073 1 1599488 +

support

G04 A09 Scaffold000003_2 Scaffold000003 4706984 9295739 + Split support 8 BinMarkers

Page 21: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

21

G04 A09 Scaffold000092 Scaffold000092 1 920634 -

support

G04 A09 Scaffold000103_2 Scaffold000103 1 469246 ? Split no support

G04 A09 Scaffold000112 Scaffold000112 1 611878 -

support

G04 A09 Scaffold000094 Scaffold000094 1 882339 +

support

G04 A09 Scaffold000009_3 Scaffold000009 1 1801029 ? Split support 5 BinMarkers

G04 A09 Scaffold000061 Scaffold000061 1 1921111 +

support

G04 A09 Scaffold000007_2 Scaffold000007 2257557 5817461 - Split support 9 BinMarkers

G04 A09 Scaffold000621 Scaffold000621 1 13908 ?

no support

G04 A09 Scaffold000047_2 Scaffold000047 1339930 2395810 - Split support 5 BinMarkers

G04 A09 Scaffold000050 Scaffold000050 1 2386279 -

support

G04 A09 Scaffold000533 Scaffold000533 1 13461 ?

no support

G04 A09 Scaffold000493 Scaffold000493 1 63376 ?

no support

G04 A09 Scaffold000071 Scaffold000071 1 2554287 -

support

G04 A09 Scaffold000505 Scaffold000505 1 15192 ?

no support

G05 A05 Scaffold000103_1 Scaffold000103 469247 824434 ? Split support 2 BinMarkers

G05 A05 Scaffold000120 Scaffold000120 1 482396 ?

support

G05 A05 Scaffold000152 Scaffold000152 1 186206 ?

no support

G05 A05 Scaffold000091_1 Scaffold000091 1 370869 - Split no support 2 BinMarkers

G05 A05 Scaffold000014_2 Scaffold000014 1582908 3506367 + Split support 9 BinMarkers

G05 A05 Scaffold000048_1 Scaffold000048 1 2314489 - Split support 14 BinMarkers

G05 A05 Scaffold000051_1 Scaffold000051 1 194542 - Split support 2 BinMarkers

G05 A05 Scaffold000031_1 Scaffold000031 1 932619 + Split support 8 BinMarkers

G05 A05 Scaffold000001_1 Scaffold000001 1 1568174 + Split support 12 BinMarkers

G05 A05 Scaffold000059_1 Scaffold000059 1 1942021 +

support

G05 A05 Scaffold000034_2 Scaffold000034 3966652 4925386 + Split support 7 BinMarkers

G05 A05 Scaffold000064_2 Scaffold000064 413785 1779331 + Split support 5 BinMarkers

Page 22: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

22

G05 A05 Scaffold000511 Scaffold000511 1 23219 ?

no support

G05 A05 Scaffold000042 Scaffold000042 1 2610373 + Split support 7 BinMarkers

G05 A05 Scaffold000133 Scaffold000133 1 369630 ?

support

G05 A05 Scaffold000076 Scaffold000076 1 1504109 +

support

G05 A05 Scaffold000006_4 Scaffold000006 6679123 7844833 + Split no support 8 BinMarkers

G05 A05 Scaffold000038 Scaffold000038 1 2985750 +

support

G05 A05 Scaffold000132 Scaffold000132 1 382277 +

support

G05 A05 Scaffold000003_3 Scaffold000003 9295740 10101789 - Split support 4 BinMarkers

G05 A05 Scaffold000143 Scaffold000143 1 260658 +

no support

G05 A05 Scaffold000009_1 Scaffold000009 5502703 7126136 ? Split support 14 BinMarkers

G05 A05 Scaffold000024 Scaffold000024 1 4645629 -

support

G05 A05 Scaffold000129 Scaffold000129 1 400740 +

support

G05 A05 Scaffold000160 Scaffold000160 1 174875 +

no support

G05 A05 Scaffold000119 Scaffold000119 1 494856 -

no support

G05 A05 Scaffold000183 Scaffold000183 1 90881 ?

no support

G05 A05 Scaffold000127 Scaffold000127 1 399023 -

support

G05 A05 Scaffold000027 Scaffold000027 1 3437377 -

support

G05 A05 Scaffold000056 Scaffold000056 1 2001222 -

support

G06 A10 Scaffold000005_1 Scaffold000005 1499897 8038702 ? Split support 52 BinMarkers

G06 A10 Scaffold000015 Scaffold000015 1 5489197 +

support 27 BinMarkers

G06 A10 Scaffold000108_2 Scaffold000108 452713 661024 - Split no support 2 BinMarkers

G06 A10 Scaffold000075 Scaffold000075 1 1477010 -

support 2 BinMarkers

G06 A10 Scaffold000057 Scaffold000057 1 1972711 -

support 9 BinMarkers

G06 A10 Scaffold000037 Scaffold000037 1 2850418 +

support 10 BinMarkers

G07 A03 Scaffold000013 Scaffold000013 1 6133134 +

support 44 BinMarkers

G07 A03 Scaffold000070 Scaffold000070 1 1618636 +

support 13 BinMarkers

Page 23: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

23

G07 A03 Scaffold000002 Scaffold000002 1 13282552 +

support 102 BinMarkers

G07 A03 Scaffold000001_5 Scaffold000001 7145795 16627912 + Split support 50 BinMarkers

G07 A03 Scaffold000008_2 Scaffold000008 784056 6702624 - Split support 19 BinMarkers

G08 A06 Scaffold000393 Scaffold000393 1 39066 ?

no support

G08 A06 Scaffold001083 Scaffold001083 1 3324 ?

no support

G08 A06 Scaffold000221 Scaffold000221 1 80530 ?

no support

G08 A06 Scaffold000595 Scaffold000595 1 10337 ?

no support

G08 A06 Scaffold000145 Scaffold000145 1 261692 -

no support

G08 A06 Scaffold000003_4 Scaffold000003 10101790 11093742 + Split no support 4 BinMarkers

G08 A06 Scaffold000014_3 Scaffold000014 3506368 5718963 + Split support 13 BinMarkers

G08 A06 Scaffold000016_2 Scaffold000016 4330174 5621156 + Split support 5 BinMarkers

G08 A06 Scaffold000177 Scaffold000177 1 147072 ?

support

G08 A06 Scaffold000091_2 Scaffold000091 370870 1095044 - Split no support 5 BinMarkers

G08 A06 Scaffold000609 Scaffold000609 1 10123 ?

no support

G08 A06 Scaffold000136 Scaffold000136 1 324742 +

no support

G08 A06 Scaffold000034_1 Scaffold000034 1 3966651 + Split support 2 BinMarkers

G08 A06 Scaffold000139 Scaffold000139 1 305217 -

support

G08 A06 Scaffold000047_1 Scaffold000047 1 1339929 - Split support 10 BinMarkers

G08 A06 Scaffold000008_1 Scaffold000008 1 784055 - Split support 5 BinMarkers

G08 A06 Scaffold000035 Scaffold000035 1 3018247 +

support

G08 A06 Scaffold000087_con Scaffold000087 1 1117089 -

support

G08 A06 Scaffold000039 Scaffold000039 1 2792378 -

support

G08 A06 Scaffold000008_3 Scaffold000008 6702625 7319999 + Split support 5 BinMarkers

G08 A06 Scaffold000023 Scaffold000023 1 4237132 -

support

G08 A06 Scaffold000118 Scaffold000118 1 527563 +

support

G08 A06 Scaffold000004 Scaffold000004 1 9448172 +

support

Page 24: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

24

G08 A06 Scaffold000052 Scaffold000052 1 2161117 +

support

G08 A06 Scaffold000053 Scaffold000053 1 2145108 +

support

G08 A06 Scaffold000082 Scaffold000082 1 1179777 -

support

G09 A08 Scaffold000140 Scaffold000140 1 286569 -

support

G09 A08 Scaffold000007_3 Scaffold000007 5817462 7741191 ? Split support 1 BinMarkers

G09 A08 Scaffold000069_1 Scaffold000069 1 994093 - Split support 8 BinMarkers

G09 A08 Scaffold000022_2 Scaffold000022 1 3811410 ? Split support 11 BinMarkers

G09 A08 Scaffold000005_2 Scaffold000005 1 1499896 ? Split support 4 BinMarkers

G09 A08 Scaffold000098 Scaffold000098 1 827123 +

support

G09 A08 Scaffold000046 Scaffold000046 1 2618469 -

support

G09 A08 Scaffold000054 Scaffold000054 1 2092286 +

support

G09 A08 Scaffold000146 Scaffold000146 1 242819 ?

no support

G09 A08 Scaffold000074 Scaffold000074 1 1513914 +

support

G09 A08 Scaffold000026 Scaffold000026 1 3517859 -

support

G09 A08 Scaffold000040_2 Scaffold000040 1017357 2812160 ? Split support 1 BinMarkers

G09 A08 Scaffold000067 Scaffold000067 1 1623910 +

support

G09 A08 Scaffold000031_2 Scaffold000031 932620 3246927 - Split support 17 BinMarkers

G09 A08 Scaffold000043 Scaffold000043 1 2595475 +

support

G10 A04 Scaffold000018 Scaffold000018 1 5044191 -

support

G10 A04 Scaffold000014_1 Scaffold000014 1 1582907 + Split support 2 BinMarkers

G10 A04 Scaffold000032 Scaffold000032 1 3167877 -

support

G10 A04 Scaffold000099 Scaffold000099 1 838119 +

support

G10 A04 Scaffold000040_1 Scaffold000040 1 1017356 + Split support 2 BinMarkers

G10 A04 Scaffold000028 Scaffold000028 1 5022525 +

support

G10 A04 Scaffold000049 Scaffold000049 1 2365081 -

support

G10 A04 Scaffold000051_2 Scaffold000051 194543 2129242 + Split support 12 BinMarkers

Page 25: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

25

G10 A04 Scaffold000090 Scaffold000090 1 975164 -

support

G10 A04 Scaffold000161 Scaffold000161 1 145531 ?

support

G10 A04 Scaffold000111 Scaffold000111 1 588963 +

support

G10 A04 Scaffold000048_2 Scaffold000048 2314490 2394815 ? Split support 1 BinMarkers

G10 A04 Scaffold000117 Scaffold000117 1 525733 +

support

G10 A04 Scaffold000176 Scaffold000176 1 114162 ?

support

Page 26: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

26

Supplemental Table 7. Comparisons of statistics on genomic annotations between B.

rapa genome V1.5 and V2.0

Genome assemVer 1.5 assemVer 2.0

Size (bp)

GC Content

283,810,373

35.26%

389,189,875

36.17%

Genes annotVer 1.5 annotVer 2.0

Number of Genes

Number of Genes on Plus Strand

Number of Genes on Minus Strand

Multi-exon Genes

Mean Gene Length (bp)

Gene density (Kb/gene)

Number of Transcripts

Percent of Transcripts with Introns

Mean Transcript Length (bp)

Mean CDS Length

Percent Coding

41,174

20,608

20,566

32,240

2,015

6.9

41,174

78.30%

2,015

1,172

17.00%

48,826

24,684

24,142

36,850

1,908

8.0

55,959

78.04%

2,348

1,100

13.80%

Exons

Number

Mean Number per Transcript

Mean Length (bp)

Total Length (bp)

annotVer 1.5

206,990

5.03

233

48,237,786

annotVer 2.0

237,462

4.86

226

53,705,886

Introns

Number

Mean Number per Transcript

Mean Length (bp)

Total Length (bp)

annotVer 1.5

165,816

4.03

209

34,732,551

annotVer 2.0

188,636

3.86

209

39,443,842

UTRs

Number of Genes Having UTRs

Mean UTR Length (bp)

Number of 5′ UTRs

Mean 5′ UTR Length (bp)

Number of 3′ UTRs

Mean 3′ UTR Length (bp)

annotVer 1.5

NA

NA

NA

NA

NA

NA

annotVer 2.0

29,423

230.23

40,854

178.95

35,836

288.69

Page 27: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

27

Supplementary Table 8. Classification of tandem gene arrays of two versions based

on their syntenic relationship.

Item V1.5 V2.0

Tandem-to-Tandema 1,458 1,517

Tandem-to-NonTandemb

453 1,636

Tandem-to-NonSyntenyc 253 372

Total 2,164 3,532

a: counterparts of tandem arrays in V1.5 (or V2.0) are also tandem arrays in V2.0 (or

V1.5)

b: tandem arrays in V1.5 (or V2.0) correspond to non-tandem arrays in V2.0 (or V1.5)

c: tandem arrays in V1.5 (or V2.0) have no syntenic counterparts in V2.0 (or V1.5)

Page 28: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

28

Supplementary Table 9. The enriched GO terms among tandem arrays in B. rapa

genome V2.0. (excel table)

Page 29: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

29

Supplemental Table 10. Comparisons of the two annotations based on their syntenic

relationship.

Item V1.5 V2.0

Total genes 41,174 48,826

Tandem (#arrays|#genes) 2,164|5,228 3,525|7,977

Tandem Redundancy Removed 38,110 44,374

Non-syntenic Genes 3,834 8,630

Syntenic Genes 34,276 35,744

One-to-one Genes 29,294 29,294

Multiple-to-one Genes 2,235 1,076

One-to-multiple Genes 2,327 4,972

Page 30: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

30

Supplementary Table 11. Statistics of mRNA-Seq data.

Tissue Total Raw Data (Gb)

Above-ground Stem 3.15

Flower 4.06

Small Anther 3.77

Middle Anther 3.32

Tender Leaf 4.7

Middle Leaf 3.32

Petiole 3.48

Seed Pod 3.26

Page 31: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

31

Supplementary Table 12. Assessment of V1.5 and V2.0 genesets in BUSCO

notation.

Version Gene number BUSCO notation assessment results

V1.5 41,174 C:93.7% [D:67%], F:2.1%, M:4.2%, n:429*

V2.0 48,826 C:93% [D:67%], F:2.8%, M:4.2%, n:429

* C:complete [D:duplicated], F:fragmented, M:missing, n:gene number.

Page 32: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

32

Supplemental Table 13. Statistics of non-coding RNAs.

Type Subtype Copy Average

Length (bp)

Total

Length (bp) % Genome

miRNA 1,295 211 273,763 0.070

tRNA 1,391 75 104,488 0.027

rRNA 1,730 299 517,909 0.133

18S 677 499 337,911 0.087

28S 573 120 68,828 0.018

5.8S 244 335 81,737 0.021

5S 236 125 29,433 0.007

snRNA 3,511 84 295,668 0.076

CD-box 3,201 79 253,329 0.065

HACA-box 130 120 15,639 0.004

splicing 180 148 26,700 0.007

Page 33: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

33

Supplementary Table 14. Detailed sub-genome information of B. rapa genome V2.0.

(excel table)

Page 34: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

34

Supplementary Table 15. Dominant gene expression between sub-genomes LF and

MFs in B. rapa genome V2.0.

Organisms #2-fold changes*

Not

expressed

Bionominal test

(LF & MFs)

LF MF1 MF2

leaf 300 178 154 87 2.61E-13

stem 280 183 154 71 6.29E-10

root 264 197 165 48 4.40E-06

*: number of genes expressed at least two-fold higher compared to both the other two

syntenic genes.

Page 35: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

35

Supplementary Table 16. Sequence homology between B. rapa version-specific

genes and other Brassica species.

Species

V1.5 V2.0

hits

identity>=70%

coverage>=70% hits

identity>=70%

coverage>=70%

A. thaliana 2,278 972 4,631 1,274

A. lyrata 2,300 945 4,459 1,257

B. oleracea 3,194 1,693 6,512 2,541

C. rubella 2,223 883 4,602 1,226

T. parvula 2,293 976 4,910 1,345

Total 3,244 1,758 6,687 2,642

Page 36: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

36

Supplementary Table 17. Enriched GO terms of over retained genes in B. rapa

genome V2.0. (excel table)

Page 37: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

37

Supplementary Table 18. Enriched GO terms of under retained genes in B. rapa

genome V2.0. (excel table)

Page 38: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

38

Supplemental Figures

Supplemental Figure 1. An example of integrating information from linkage map

and synteny map between B. rapa and (A) A. thaliana or (B) S. parvula to determine

the order of scaffolds in local regions of chromosomes (A04). Red plots represent the

genetic map of the RILs population, and green plots represent the genetic map of the

F1DH population.

Page 39: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

39

Supplemental Figure 2. Scatterplot showing the correspondence between physical

position and genetic distance of RILs (blue) and F1DH (red) populations in ten

chromosomes of B. rapa.

Page 40: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

40

Supplemental Figure 3. Distribution of total tandem genes and tandem arrays along

chromosomes in B. rapa genome V1.5 and V2.0. A 6-Mb sliding window with a 1-Mb

step was applied to screen tandem genes and tandem arrays across ten chromosomes.

The y-axis denotes the number of tandem genes and tandem arrays in each window.

Page 41: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

41

Supplemental Figure 4. Comparisons of gene annotations in the two assemblies of B.

rapa genome. (A) Classification of corresponding gene-pairs in the two gene

annotations. (B) Illustration of the classification of gene-pairs between the two

assemblies, namely one-to-one gene-pairs, one-to-multiple gene arrays, and

multiple-to-one gene-pairs. (C) Examples of the sequence alignment of one-to-one

gene arrays show the variations in the two genesets. The first shows only a point

mutation between the two protein sequences; the second shows a gene in V2.0 to be

annotated with more coding sequences than V1.5; and the third shows a gene in V1.5

to have more coding sequences than V2.0. The fourth example shows the combined

variations of the several cases given above.

Page 42: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

42

Supplemental Figure 5. Comparisons of TEs between B. rapa genome V1.5 and

V2.0. (A) Frequency distributions of TE subgroups in the two assemblies of B. rapa

genome. (B) Number and frequency distributions of TE with different lengths in the

two assemblies. The x-axis stands for the lengths of TEs, and the y-axis denotes the

TE counts (left) and percentage (right).

Page 43: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

43

Supplemental Figure 6. Distribution of TEs along ten chromosomes in B. rapa

genome V1.5 and V2.0. A 500-kb sliding window with a 100-kb step was applied to

screen TEs across the ten chromosomes. The y-axis denotes the ratio of TE sequences

in each window.

Page 44: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

44

Supplemental Figure 7. Examples showing more TEs were annotated in

corresponding intergenic regions of the two genome assemblies of B. rapa. Blue

triangles represent TEs located in the intergenic regions. Red arrows link

corresponding gene pairs that show syntenic relationships. (A) A new TE in V2.0

shows limited similarity (coverage=71.9%) to intergenic region between genes

Bra000493 and Bra000494. (B) Three more TEs were annotated between

BraA04001951 and BraA04001952 than between Bra034310 and Bra034311.

Page 45: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

45

Supplemental Figure 8. The density of orthologous genes in three sub-genomes (LF,

MF1 and MF2) of B. rapa genome V2.0 compared to A. thaliana. The x-axis denotes

seven ancestral chromosomes of Brassica anceatral genomes. The y-axis denotes the

percentage of retained orthologous genes in B. rapa sub-genomes around each A.

thaliana gene, with a total window size of 1,001 genes, 500 genes flanking each side

of a certain gene. Part of D block was lost in sub-genome LF, resulted in the

decreased gene density in the head of tPCK7.

Page 46: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

46

Supplemental Figure 9. RNA+ TEs show negative association with expression levels

of gene doublets between sub-genome LF and MF1, LF and MF2. (A) Dominantly

expressed genes from sub-genome LF show low level of RNA+ TEs in the 2 Kb of 5′

UTR regions. dLF denotes dominantly expressed genes from LF. (B) Dominantly

expressed genes from sub-genome MF1 show low level of RNA+ TEs in the 2 Kb of

5′ UTR regions. dMF1 denotes dominantly expressed genes from MF1. 1,046 and 665

gene doublets of dLF-MF1 and LF-dMF1 were considered respectively. (C)

Dominantly expressed genes from sub-genome LF show low level of RNA+ TEs in

the 2 kb of 5′ UTR regions. (D) Dominantly expressed genes from sub-genome MF1

show low level of RNA+ TEs in the 2 kb of 5′ UTR regions. dMF2 denotes

dominantly expressed genes from MF2. 878 and 533 gene doublets of dLF-MF2 and

LF-dMF2 were considered respectively.

Page 47: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

47

Supplemental References

.

Birney, E., Clamp, M., and Durbin, R. (2004). GeneWise and genomewise. Genome research

14:988-995.

Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. (2011). Scaffolding pre-assembled

contigs using SSPACE. Bioinformatics 27:578-579.

Cheng, F., Sun, C., Wu, J., Schnable, J., Woodhouse, M.R., Liang, J., Cai, C., Freeling, M., and Wang, X.

(2016). Epigenetic regulation of subgenome dominance following whole genome triplication

in Brassica rapa. The New phytologist 211:288-299.

Cheng, F., Wu, J., Fang, L., and Wang, X. (2012). Syntenic gene analysis between Brassica rapa and

other Brassicaceae species. The Brassica Genome:5.

English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., and Worley,

K.C. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read

sequencing technology. PloS one 7:e47768.

Feng, C., Jian, W., Lu, F., Silong, S., Bo, L., Ke, L., Guusje, B., and Xiaowu, W. (2012). Biased Gene

Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa. PloS

one 7:e36442.

Freeling, M. (2009). Bias in plant gene content following different sorts of duplication: tandem,

whole-genome, segmental, or by transposition. Annual review of plant biology 60:433-453.

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L.,

Raychowdhury, R., and Zeng, Q. (2011). Trinity: reconstructing a full-length transcriptome

without a genome from RNA-Seq data. Nature biotechnology 29:644.

Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., and Bateman, A. (2005). Rfam:

annotating non-coding RNAs in complete genomes. Nucleic acids research 33:D121-D124.

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Hannick, L.I., Maiti, R., Ronning,

C.M., Rusch, D.B., and Town, C.D. (2003). Improving the Arabidopsis genome annotation

using maximal transcript alignment assemblies. Nucleic acids research 31:5654-5666.

Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White, O., Buell, C.R., and Wortman,

J.R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the

Program to Assemble Spliced Alignments. Genome biology 9:R7.

Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty,

L., and Duquenne, L. (2009). InterPro: the integrative protein signature database. Nucleic

acids research 37:D211-D215.

Larkin, M.A., Blackshields, G., Brown, N., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F.,

Wallace, I.M., Wilm, A., and Lopez, R. (2007). Clustal W and Clustal X version 2.0.

Bioinformatics 23:2947-2948.

Larose, D.T. (2005). k‐Nearest Neighbor Algorithm. Discovering Knowledge in Data: An Introduction

to Data Mining:90-106.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv

preprint arXiv:1303.3997.

Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., and Wang, J. (2009a). SNP detection for

massively parallel whole-genome resequencing. Genome research 19:1124-1132.

Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009b). SOAP2: an improved

ultrafast tool for short read alignment. Bioinformatics 25:1966-1967.

Page 48: 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 4 and

48

Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA

genes in genomic sequence. Nucleic acids research 25:955-964.

Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., and Liu, Y. (2012).

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

GigaScience 1:1-6.

Parra, G., Bradnam, K., and Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in

eukaryotic genomes. Bioinformatics 23:1061-1067.

Salmela, L., and Rivals, E. (2014). LoRDEC: accurate and efficient long read error correction.

Bioinformatics:btu538.

She, R., Chu, J.S.-C., Wang, K., Pei, J., and Chen, N. (2009). GenBlastA: enabling BLAST to identify

homologous gene sequences. Genome research 19:143-149.

Simao, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M. (2015). BUSCO:

assessing genome assembly and annotation completeness with single-copy orthologs.

Bioinformatics 31:3210-3212.

Sun, C., Wu, J., Liang, J., Schnable, J.C., Yang, W., Cheng, F., and Wang, X. (2015). Impacts of

Whole-Genome Triplication on MIRNA Evolution in Brassica rapa. Genome biology and

evolution 7:3085-3096.

Sun, X., Liu, D., Zhang, X., Li, W., Liu, H., Hong, W., Jiang, C., Guan, N., Ma, C., and Zeng, H. (2013).

SLAF-seq: an efficient method of large-scale de novo SNP discovery and genotyping using

high-throughput sequencing. PLoS One 8:e58700.

Tarailo‐Graovac, M., and Chen, N. (2009). Using RepeatMasker to identify repetitive elements in

genomic sequences. Current Protocols in Bioinformatics:4.10. 11-14.10. 14.

Van Ooijen, J. (2006). JoinMap® 4, Software for the calculation of genetic linkage maps in experimental

populations. Kyazma BV, Wageningen 33:10.1371.

Wang, F., Li, L., Li, H., Liu, L., Zhang, Y., Gao, J., and Wang, X. (2012). Transcriptome analysis of rosette

and folding leaves in Chinese cabbage using high-throughput RNA sequencing. Genomics

99:299-307.

Wang, X., and Cheng, F. (2016). Epigenetic regulation of subgenome dominance following whole

genome triplication in Brassica rapa. New Phytologist:1.

Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.-H., Bancroft, I., and Cheng, F.

(2011). The genome of the mesopolyploid crop species Brassica rapa. Nature genetics

43:1035-1039.

Xu, Z., and Wang, H. (2007). LTR_FINDER: an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic acids research 35:W265-W268.

Yu, X., Wang, H., Zhong, W., Bai, J., Liu, P., and He, Y. (2013). QTL mapping of leafy heads by genome

resequencing in the RIL population of Brassica rapa. PloS one 8:e76059.