1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...

1

Supplemental Information 1

2

Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly 3

and gene re-annotation 4

5

Chengcheng Cai1, Xiaobo Wang

1, Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng 6

Cheng*, and Xiaowu Wang

* 7

8

Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, 9

Zhongguancun Nandajie No.12, Haidian district, Beijing 100081, P.R. China. 10

1These authors contributed equally to this article. 11

*Corresponding authors. 12

13

Contents 14

Supplemental Methods 15

Supplemental Tables 16

Supplemental Figures 17

Supplemental References 18

19

20

2

Supplemental Methods 21

22

Genome de novo assembly 23

55 Gb (114×) paired-end reads data were generated from two libraries with insert size 24

of 250 bp. The read length is 150 bp. 8.7 Gb (18×) reads data were generated from 25

two mate-paired libraries with insert size of 20 Kb and 40 Kb, respectively. These 26

data were all produced using the Illumina Hiseq2500 sequencing platform. 27

Furthermore, 6.5 Gb (13.4×) single-molecule sequencing data were generated (PacBio 28

reads) with an average length of 12 Kb. In addition to these newly generated data, all 29

datasets used to assemble B. rapa genome V1.5 (Wang et al., 2011) were also 30

incorporated in this genome re-assembly. 31

Low quality reads of Illumina raw data were filtered as follows: 1) read 32

containing “N”; 2) read whose average Phred-like score was less than 20; 3) trimmed 33

read shorter than 50 bp, after a test of trimming the 3′ terminal nucleotides whose 34

Phred-like score is less than 13. Duplicated reads generated from one amplicon were 35

removed to keep only one copy. Meanwhile, the sequencing errors in PacBio reads 36

were corrected using the 150 bp paired-end Illumina reads using tool LoRDEC 37

(Salmela and Rivals, 2014) with parameters “-k 19 -s 3.” Those clean reads were then 38

used for genome assembly. 39

In the process of genome assembly, first, all the 150 bp paired-end reads (114×) 40

were used to assemble contigs by SOAPDenovo2 (Luo et al., 2012) with k-mer = 91 41

bp. Next, the large insert-size mate-paired libraries (>2 Kb) used in the previous 42

assembly plus two newly generated libraries (20 Kb and 40 Kb) were used to link 43

contigs into scaffolds using SSPACE (Boetzer et al., 2011). Then, publicly available 44

BAC-ends sequences were mapped to scaffolds using BLAST. These scaffolds were 45

further linked into super-scaffolds according to the paired relationships of BAC-ends 46

sequences. Finally, all the paired-end Illumina reads from short insert size libraries 47

(<1 Kb) were used to close gaps in these super-scaffolds using GapCloser (Luo et al., 48

2012) with default parameters. The single-molecule PacBio reads were further used to 49

close gaps in these super-scaffolds using PBjelly_V15.2.20 (English et al., 2012) with 50

3

parameters “--minGap 1 -minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 51

-maxScore -500 -nproc 5 -noSplitSupreads”. 52

The quality of the genome V1.5 and V2.0 were validated through CEGMA 53

analysis (Parra et al., 2007). 458 core eukaryotic genes (CEG database) were 54

BLASTed to the genome assembly, which showed hits of 455 (99.34%) and 454 55

(99.13%) CEG proteins for all 458 genes in CEG in the genome V1.5 and V2.0, 56

respectively. The genomes were also validated by matching B. rapa ESTs 57

downloaded from NCBI, which showed that 99.16% and 99.34% ESTs to be 58

supported by the assembled genome V1.5 and V2.0, respectively. 59

60

Construction of pseudo-chromosomes 61

To correct mis-assembled scaffolds and assign corrected scaffolds to ten 62

chromosomes of B. rapa, two genetic maps were constructed for a previously reported 63

RILs (recombinant inbred lines) population (Yu et al., 2013) and a FIDH (F1 doubled 64

haploid) population of B. rapa. The RILs population contains 150 recombinant inbred 65

lines (RILs) derived from a cross between a heading Chinese cabbage (B. rapa ssp. 66

pekinensis cv Bre) and a non-heading B. rapa (B. rapa ssp. chinensis cv Wut) (Yu et 67

al., 2013). Low-depth sequencing data of RILs population were downloaded from 68

NCBI (http://www.ncbi.nlm.nih.gov/). The F1DH population included 120 F1DH 69

lines derived from a cross between a Chinese cabbage DH line Z16 (B. rapa ssp. 70

pekinensis) and a rapid cycling line L144 (B. rapa ssp. oleifera). The F1DH 71

population was previously used to develop a genetic map with InDels markers to 72

construct pseudo-chromosomes for B. rapa genome V1.5 (Wang et al., 2011). In this 73

work, SLAF-Seq (Sun et al., 2013) was perforemed on the 120 F1DH lines and their 74

two parents. 75

Two high-density linkage maps were constructed following the below procedures. 76

First, raw data were filtered to remove low-quality reads with the aforementioned 77

rules. Next, clean reads of the two parents of a population were aligned to reference 78

genome of B. rapa (V2.0) using SOAPaligner (SOAP2) (Li et al., 2009b) with 79

http://www.ncbi.nlm.nih.gov/

4

parameters “-m 100 –x 1000 –r 0” and those paired-end reads with unique hits to the 80

reference were preserved. SOAPsnp (Li et al., 2009a) was then used to call SNPs with 81

parameter “-L 100.” Confident SNPs were selected using following criteria: 1) 82

average quality score was over 20; 2) at least covered by three reads; 3) only 83

homozygous genotype was considered. After that, SNP loci showing polymorphism 84

between two parents were kept for further analysis. Each line of the population was 85

aligned to the B. rapa genome V2.0 using SOAP2 (Li et al., 2009b) with the same 86

parameter. Genotypes were identified for each line at those SNP loci showing 87

polymorphism between two parents. These ungenotyped loci were imputed using 88

algorithm of k-NN (the k nearest neighbors) (Larose, 2005). The parental inheritance 89

of each SNP loci was then determined for each line of the population, and genomic 90

regions showing complete linkage disequilibrium in the whole population were 91

merged to form bin-markers. Finally, these bin-markers were submitted to software 92

JoinMap (version 4.0) (Van Ooijen, 2006) to construct the genetic map. This process 93

was repeated to both the RILs and F1DH populations to build two genetic maps. 94

28 mis-assembled scaffolds were corrected with information from the two 95

genetic maps. The 28 scaffolds were split and their fragments were assigned to 96

chromosomes as other scaffolds. Information of the RILs genetic map was served as 97

the principal evidence (higher density of resequencing) to direct the orders of those 98

assembled scaffolds. The genetic map built on SLAF-Seq of F1DH population was 99

used to assist the ordering and orientation of scaffolds when there was limited or no 100

recombination information in the RILs map. The physical genomic syntenic 101

relationships between B. rapa and A. thaliana or S. parvula were also used as 102

evidences in determining the order and orientation of scaffolds in local regions. 103

104

Constructing transcripts from mRNA-Seq data 105

29 Gb mRNA-Seq data were generated from eight tissues of B. rapa: aboveground 106

stem, flower, small anther, middle anther, tender leaf, middle leaf, petiole, and seed 107

pod. Detailed information of each tissue is listed in Supplemental Table 11. In 108

addition to the newly generated data, mRNA-Seq datasets from another two 109

5

previously reported datasets were also incorporated. The first dataset was used to 110

verify gene models in the first release of B. rapa genome (Wang et al., 2011). The 111

second dataset was used in the study of gene expression during leaf development of 112

the Chinese cabbage (Wang et al., 2012). 113

Both de novo and genome-guided approaches were used to assemble these large 114

volumes of mRNA-Seq reads into transcripts. For the de novo approach, the Trinity 115

(Grabherr et al., 2011) tool package was used with default parameters to assemble 116

mRNA-Seq reads. For the genome-guided approach, several Perl scripts 117

(alignReads.pl, prep_rnaseq_alignments_for_genome_assisted_assembly.pl, 118

GG_write_trinity_cmds.pl, ParaFly and GG_trinity_accession_incrementer.pl) 119

embedded in Trinity were used to complete the alignment and assembly processes 120

step by step. 121

122

Genome annotation 123

Gene prediction process consisted of the following steps: 1) ab initio gene modeling; 124

2) detection of homologous genes; 3) transcript fragments mapping; and 4) merging 125

of the three predictions. Repeat sequences were masked before gene annotation. In the 126

first step, two gene predictors, Augustus and GlimmerHMM, were used for de novo 127

gene prediction. For reports of both tools, predicted genes whose coding regions were 128

shorter than 150 bp were filtered. In the second step, two datasets, A. thaliana 129

(TAIR10) and C. rubella protein sequences were collected. GenBlastA (She et al., 130

2009) was used to align A. thaliana and C. rubella protein sequences to the masked B. 131

rapa genome with parameters “-e 1e-5.” The candidate gene sequences generated by 132

genBlastA along with homologous proteins were further processed by GeneWise 133

(Birney et al., 2004) with default parameters to predict gene structure. For the third 134

step, three kinds of evidence were used to perform transcripts-assisted gene prediction: 135

1) a set of 214,482 B. rapa ESTs downloaded from NCBI; 2) de novo assembled 136

transcripts; and 3) genome-guided assembled transcripts. These transcript-related 137

6

sequences were aligned to the masked genome and results were processed by PASA 138

(Haas et al., 2003) with default parameters. 139

In the final step, EVidenceModeler (EVM) (Haas et al., 2008) was used to merge 140

all results from de novo prediction, homologous prediction, and PASA annotation, 141

into a weighted consensus gene dataset (weight values: 1 for de novo prediction, 5 for 142

homologous prediction, and 10 for PASA annotation). The EVM results were further 143

filtered with following criteria: 1) genes with protein-coding regions shorter than 144

150 bp; 2) incomplete genes without start and stop codons; and 3) genes with 145

premature termination in translation process. Finally, PASA was used to improve the 146

EVM gene models by modifying exons, adding UTRs, and determining alternatively 147

spliced isoforms. Prediction of gene sets V1.5 and V2.0 were further assessed and 148

compared using BUSCO analysis (Simao et al., 2015) (Supplemental Table 12). 149

After gene model prediction, gene function annotation was performed. 150

InterProScan (Hunter et al., 2009) was used to annotate motifs and domains by 151

comparing predicted genes with available databases such as PROSITE, PRINTS, 152

Pfam, ProDom, and SMART. The GO annotation was extracted from the output of 153

InterProScan. The predicted proteins were further aligned to the Swiss-Prot, TrEMBL, 154

and KEGG databases using BLASTP at E value 1×10-5

to obtain other annotation 155

information. All of these four annotation datasets are freely available through: 156

http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/. 157

158

Annotation of transposable elements and non-coding RNAs 159

Transposable elements (TEs) were annotated using RepeatMasker (Tarailo‐Graovac 160

and Chen, 2009). In order to compare features of TEs between B. rapa genome V1.5 161

and V2.0, TEs were re-annotated on both genome versions using the same methods 162

described elsewhere (Wang and Cheng, 2016). For these newly annotated TEs in V2.0, 163

two situations were observed. 1) Extra TEs in V2.0 without counterparts found in 164

corresponding regions of V1.5 (Supplemental Figure 7A); 2) More TEs were 165

annotated in V2.0 than those in V1.5 in corresponding regions (Supplemental 166

http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/

7

Figure 7B). Full length LTRs were identified using LTR_Finder with default 167

parameters (Xu and Wang, 2007). 168

169

According to the structural characteristics of tRNA, tRNAscan-SE-1.23 (Lowe and 170

Eddy, 1997) was used to identify tRNA sequences. rRNA was located by aligning 171

known full-length rRNAs of plants onto the B. rapa genome. snRNA sequences were 172

predicted using Rfam-9.1 (Griffiths-Jones et al., 2005). miRNA was predicted using 173

the similar methods reported by Sun et al (Sun et al., 2015). Totally, we annotated 174

different kinds of ncRNAs that accounted for ≈0.306% of the updated genome 175

(Supplemental Table 13). 176

177

Sub-genomes reconstruction and analysis 178

Based syntenic relationship between B. rapa genome V2.0 and A. thaliana, the least 179

fractionated (LF), the medium fractionated (MF1) and the most fractionated (MF2) 180

sub-genomes of B. rapa V2.0 were built. Detailed information of paralogous genes in 181

three sub-genomes was listed in Supplemental Table 14. The LF sub-genome 182

maintained more gene copies than MF1/2 sub-genome. Part of block D was detected 183

to be lost which resulted in the low gene density in LF sub-genome at the beginning 184

region of ancestral chromosome tPCK7 (Supplemental Figure 8). 185

2,058 fully retained genes in all three sub-genomes were identified. After 186

removing TA (tandem) genes and MT (one gene in B.rapa genome V1.5 or in A. 187

thaliana, while multiple homologous genes in V2.0) genes, genes from 1,307 fully 188

retained homoeologs were chosen to analyze patterns of gene expression among 189

sub-genomes using the same methods described elsewhere (Feng et al., 2012). Genes 190

in sub-genome LF are dominantly expressed over genes in MF1 and MF2 191

(Supplemental Table 15). 192

The relationship between twenty-four nucleotides small RNA targeted TEs and 193

the expression pattern of gene doublets was analyzed using similar methods reported 194

previously (Cheng et al., 2016). mRNA-Seq data of three organs (root, leaf and stem) 195

and small RNA data for the leaf used in this study were retrieved from 196

8

http://brassicadb.org/brad/datasets/pub/srna2/. For sub-genome LF and MF1, LF and 197

MF2, small RNA targeted TEs (RNA+ TEs) showed negative association with 198

expression levels of gene doublets, namely, dominantly expressed genes from a 199

sub-genome showed low level of RNA+ TEs in the 2 kb of 5′ UTR regions 200

(Supplemental Figure 9). 201

202

Comparison of gene annotation between V1.5 and V2.0 203

Genome synteny analysis was performed between the two versions using SynOrths 204

(Cheng et al., 2012) to determine corresponding gene pairs (i.e. the same gene loci 205

in two different assemblies) and tandem gene arrays. With the results reported by 206

SynOrths, relationships of tandem gene arrays in the two genome versions were 207

classified into three categories: 1) tandem arrays in one version are syntenic to tandem 208

arrays in the other version; 2) tandem arrays in one version correspond to non-tandem 209

arrays in the other one; 3) tandem arrays in one version can’t map to any gene in the 210

other version. Non-tandem genes in the two predictions were classified into four 211

categories: 1) one gene in V1.5 is the counterpart of one gene in V2.0 (one-to-one 212

gene pairs); 2) one gene in V1.5 corresponds to several genes (≥2) in V2.0 213

(one-to-multiple genes); 3) several genes (≥2) in V1.5 corresponds to one gene in 214

V2.0 (multiple-to-one genes) (Supplemental Figure 4B); and 4) genes specific to each 215

annotation (genes with no counterparts in the other versuib). 216

To further analyze the differences in genes between these two predictions, a 217

detailed analysis was performed for each of the above four gene categories. The 218

protein sequences of one-to-one gene pairs were compared using Clustalx (Larkin et 219

al., 2007) to assess variations in gene coding sequences. For one-to-multiple and 220

multiple-to-one genes, mRNA-Seq reads were mapped to the CDs sequences of these 221

gene pairs using BWA (Li, 2013) to find evidence supporting whether these genes 222

were correctly annotated. The work focused on paired-end reads with one read 223

mapped to one of the multiple genes, while the other read mapped to another one of 224

the multiple genes. It indicates that the two genes should be re-annotated as one gene. 225

9

For these annotation and version-specific genes (no counterparts), the following two 226

analysis was performed to assess their reliability: 1) mRNA-Seq reads were mapped 227

to the CDSs of these group-specific genes using BWA; 2) They were BLASTed 228

(BLASTP) against proteins of five other Brassica species, with an E-value of 1×10-5

229

(Supplemental Table 16). Mapping of mRNA-Seq reads to CDS of these genes found 230

that 3,470 genes in V1.5 and 6,350 in V2.0 had mRNA-Seq evidences to support the 231

reliability of these version-specific genes. 3,244 genes in V1.5 and 6,687 genes in 232

V2.0 had BLASTP hits to proteins of other Brassica species (Supplemental Table 16). 233

234

GO term enrichment analysis 235

For each GO term, its occurrence in the 3-copy gene set and in combined 1- and 236

2-copy geneset were counted in V2.0. And the occurrence in the 1-copy gene set and 237

in combined 2- and 3-copy gene set were also counted as reported previously (Wang 238

et al., 2011). Fisher’s exact tests “1+2 vs 3” and “1 vs 2+3” were performed to test the 239

over retention of 3-copy and the under retention of 1-copy orthologous GO terms, 240

respectively. The results showed that genes encoding subunits of proteasomes, 241

ribosomes, and transcription factor complexes were over retained (Supplemental 242

Table 17). While genes associated with DNA repair, binding and chloroplast were 243

under retained (Supplemental Table 18). These finding were in accordance with the 244

gene balance hypothesis which predicts that genes with products interact with other 245

gene products are more likely to be over retained and otherwise they are more likely 246

to be under retained (Freeling, 2009). Genes associated with response to environment 247

factors and plant hormones were also over retained (Supplemental Table 17). 248

Using similar method, GO enrichment of tandem gene arrays in B. rapa genome 249

V2.0 were also analyzed. The occurrence of each GO term in the tandem arrays and 250

non-tandem genes were counted. Fisher’s exact test was performed to test whether a 251

GO term was enriched in the tandem gene arrays. Results show that genes related to 252

defense response, membrane functions, and different kinds of enzyme activity were 253

enriched in tandem gene arrays (Supplemental Table 9). 254

255

10

256

Accession numbers 257

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank 258

under the accession AENI00000000. The version described in this paper is version 259

AENI02000000. The genome assembly and gene annotation results were also freely 260

available through BRAD website (Brassica database, 261

http://brassicadb.org/brad/datasets/pub/Genomes/Brassica_rapa/V2.0/). 262

263

264

11

Supplemental Tables 265

266

Supplemental Table 1. Summary of Illumina sequencing data used in the assembly 267

of B. rapa genome V2.0. 268

Insert Size

(bp)

Read Length

(bp)

Raw Data

(bp)

Clean Data

(bp)

Sequence

Depth (X)

250* 150 73,219,780,500 55,479,375,000 114.39

20,000* 44/49/90 1,890,331,4720 5,638,004,562 11.62

40,000* 90 16,801,802,820 3,105,338,220 6.4

PacBio* / 6,473,983,398 6,412,500,309** 13.22

200 44/75/100 19,264,162,762 13,697,925,048 28.24

500 44/75 7,807,414,874 6,407,407,132 13.21

2,000 44/75 3,581,696,974 3,322,057,634 6.85

5,000 44 3,212,095,304 2,297,055,728 4.74

8,000 44 2,469,275,072 1,255,661,264 2.59

10,000 44 6,106,076,328 4,530,229,440 9.34

14,000 44 2,066,000,464 771,807,344 1.59

BAC-end / 154,631,945 / /

*: newly generated in this project; 269

**: corrected previous Illumina reads. 270

271

12

Supplemental Table 2. Comparisons of statistics between B. rapa genome assembly V1.5 and V2.0.

V1.5 V2.0 (Scaffold Refined with PacBio)

Contig

Size

(bp)

Number

of

Contigs

Scaffold

Size

(bp)

Number

of

Scaffolds

Contig

Size

(bp)

Number

of

Contigs

Scaffold

Size

(bp)

Number

of

Scaffolds

N90 9,900 6,351 308,587 180 5,939 8,456 25,622 349

N80 19,011 4,395 635,699 118 19,181 5,373 975,164 89

N70 27,303 3,202 1,100,015 85 30,559 3,878 1,852,239 62

N60 36,355 2,334 1,372,560 62 41,205 2,850 2,610,373 44

N50 46,088 1,663 1,846,652 44 52,684 2,063 3,377,735 31

Total Size 273,100,332

283,810,373

366,413,862

389,189,875

Total Number

(>=100 bp)

51,647

40,576

96,883

86,986

Total Number

(>=2 kb)

9,553

821

10,673

2,178

13

Supplemental Table 3. Comparisons of statistics on B. rapa genome assembly V2.0

before and after using PacBio data.

Contig (before using

PacBio Data)

Scaffold (before using

PacBio Data)

Scaffold Refined

with PacBio Data

Size (bp) Number Size (bp) Number Size (bp) Number

N90 2,566 14,739 40,173 354 25,622 349

N80 10,065 8,939 784,999 97 975,164 89

N70 16,336 6,370 1,636,670 66 1,852,239 62

N60 22,619 4,640 2,289,139 47 2,610,373 44

N50 29,362 3,348 2,973,276 33 3,377,735 31

Total size 333,541,035

370,219,313

389,189,875

Total

Number

(>100bp)

104,563

87,323

86,986

Total

Number

(>2kb)

15,785

1,697

2,178

Gaps

36,678,278

22,776,013

14

Supplemental Table 4. Statistics of two genetic maps and assignment of assembly V2.0 to ten chromosomes of B. rapa with the two maps.

RILs DH

Pseudo-

chromosome

No. of

Binmarkers

Position

(cM)

No. of

Linked

Scaffolds

Length

(Mb)

No. of

Binmarkers

Position

(cM)

No. of

Linked

Scaffolds

Length

(Mb)

A01 196 162.14 29 33.7 133 128.911 27 33.3

A02 172 133.611 15 30.4 109 140.507 14 30.0

A03 228 178.902 5 36.4 111 170.588 7 39.3

A04 108 123.855 14 23.4 85 95.46 14 23.4

A05 214 117.505 30 36.0 150 153.926 25 35.5

A06 256 115.225 25 39.7 137 137.036 18 37.3

A07 149 123.353 10 29.7 117 148.978 11 34.8

A08 116 115.619 15 27.7 111 127.699 14 27.4

A09 264 136.684 31 54.4 143 188.759 29 53.9

A10 102 109.837 6 18.5 71 99.652 9 21.2

Total 1,805 1,316.731 146 329.9 1,167 1,391.516 138 336.1

15

Supplemental Table 5. Lists of mis-assembled scaffolds and their splicing information.

Chr ID Sca ID Sca splited ID Start End Order Supporting RILs

Binmarkers

A05 Scaffold000001 Scaffold000001_1 1 1568174 + 12 BinMarkers


A09 Scaffold000001 Scaffold000001_3 2839152 6711068 - 8 BinMarkers






A06 Scaffold000003 Scaffold000003_4 10101790 11093742 +

A10 Scaffold000005 Scaffold000005_1 1499897 8038702 ? 52 BinMarkers











16



























17










A01 Scaffold000064 Scaffold000064_1 1 413784 -







A09 Scaffold000103 Scaffold000103_2 1 469246 ?



18

Supplemental Table 6. Detailed information about the order and orientation of scaffolds on ten chromosomes of B. rapa.

Group

ID

Chr

ID

Splited Sca

Order

Sca

Order Start End Order

Split

Sca

Z16XL144DH

Linkage Map

RILs

Linkage Map

G01 A01 Scaffold000019 Scaffold000019 1 4924075 -

support

G01 A01 Scaffold000022_1 Scaffold000022 3811411 4302067 ? Split support 3 BinMarkers

G01 A01 Scaffold001067 Scaffold001067 1 6655 ?

no support


support

G01 A01 Scaffold000025 Scaffold000025 1 3737062 +

support

G01 A01 Scaffold000108_1 Scaffold000108 1 452712 - Split support 4 BinMarkers


support


support

G01 A01 Scaffold000007_1 Scaffold000007 1 2257556 + Split support 5 BinMarkers


support


support


G01 A01 Scaffold000064_1 Scaffold000064 1 413784 - Split no support 4 BinMarkers



support





support


support

19


support


support


support


support


support


support


support


support


support


support


support


support


support


support


support



support



support



no support




support


support


20




support





support


support


support



no support


support


support


support


support


support


support



support


no support


support





support


21


support

G04 A09 Scaffold000103_2 Scaffold000103 1 469246 ? Split no support


support


support



support



no support



support


no support


no support


support


no support



support


no support







G05 A05 Scaffold000059_1 Scaffold000059 1 1942021 +

support



22


no support

G05 A05 Scaffold000042 Scaffold000042 1 2610373 + Split support 7 BinMarkers


support


support

G05 A05 Scaffold000006_4 Scaffold000006 6679123 7844833 + Split no support 8 BinMarkers


support


support



no support



support


support


no support


no support


no support


support


support


support



support 27 BinMarkers












23






no support


no support


no support


no support


no support

G08 A06 Scaffold000003_4 Scaffold000003 10101790 11093742 + Split no support 4 BinMarkers




support



no support


no support



support




support

G08 A06 Scaffold000087_con Scaffold000087 1 1117089 -

support


support



support


support


support

24


support


support


support


support






support


support


support


no support


support


support



support



support


support



support


support



support


support


25


support


support


support



support


support

26

Supplemental Table 7. Comparisons of statistics on genomic annotations between B.

rapa genome V1.5 and V2.0

Genome assemVer 1.5 assemVer 2.0

Size (bp)

GC Content

283,810,373

35.26%

389,189,875

36.17%

Genes annotVer 1.5 annotVer 2.0

Number of Genes

Number of Genes on Plus Strand

Number of Genes on Minus Strand

Multi-exon Genes

Mean Gene Length (bp)

Gene density (Kb/gene)

Number of Transcripts

Percent of Transcripts with Introns

Mean Transcript Length (bp)

Mean CDS Length

Percent Coding

41,174

20,608

20,566

32,240

2,015

6.9

41,174

78.30%

2,015

1,172

17.00%

48,826

24,684

24,142

36,850

1,908

8.0

55,959

78.04%

2,348

1,100

13.80%

Exons

Number

Mean Number per Transcript

Mean Length (bp)

Total Length (bp)

annotVer 1.5

206,990

5.03

233

48,237,786

annotVer 2.0

237,462

4.86

226

53,705,886

Introns

Number

Mean Number per Transcript

Mean Length (bp)

Total Length (bp)

annotVer 1.5

165,816

4.03

209

34,732,551

annotVer 2.0

188,636

3.86

209

39,443,842

UTRs

Number of Genes Having UTRs

Mean UTR Length (bp)

Number of 5′ UTRs

Mean 5′ UTR Length (bp)

Number of 3′ UTRs

Mean 3′ UTR Length (bp)

annotVer 1.5

NA

NA

NA

NA

NA

NA

annotVer 2.0

29,423

230.23

40,854

178.95

35,836

288.69

27

Supplementary Table 8. Classification of tandem gene arrays of two versions based

on their syntenic relationship.

Item V1.5 V2.0

Tandem-to-Tandema 1,458 1,517

Tandem-to-NonTandemb

453 1,636

Tandem-to-NonSyntenyc 253 372

Total 2,164 3,532

a: counterparts of tandem arrays in V1.5 (or V2.0) are also tandem arrays in V2.0 (or

V1.5)

b: tandem arrays in V1.5 (or V2.0) correspond to non-tandem arrays in V2.0 (or V1.5)

c: tandem arrays in V1.5 (or V2.0) have no syntenic counterparts in V2.0 (or V1.5)

28

Supplementary Table 9. The enriched GO terms among tandem arrays in B. rapa

genome V2.0. (excel table)

29

Supplemental Table 10. Comparisons of the two annotations based on their syntenic

relationship.

Item V1.5 V2.0

Total genes 41,174 48,826

Tandem (#arrays|#genes) 2,164|5,228 3,525|7,977

Tandem Redundancy Removed 38,110 44,374

Non-syntenic Genes 3,834 8,630

Syntenic Genes 34,276 35,744

One-to-one Genes 29,294 29,294

Multiple-to-one Genes 2,235 1,076

One-to-multiple Genes 2,327 4,972

30

Supplementary Table 11. Statistics of mRNA-Seq data.

Tissue Total Raw Data (Gb)

Above-ground Stem 3.15

Flower 4.06

Small Anther 3.77

Middle Anther 3.32

Tender Leaf 4.7

Middle Leaf 3.32

Petiole 3.48

Seed Pod 3.26

31

Supplementary Table 12. Assessment of V1.5 and V2.0 genesets in BUSCO

notation.

Version Gene number BUSCO notation assessment results

V1.5 41,174 C:93.7% [D:67%], F:2.1%, M:4.2%, n:429*

V2.0 48,826 C:93% [D:67%], F:2.8%, M:4.2%, n:429

* C:complete [D:duplicated], F:fragmented, M:missing, n:gene number.

32

Supplemental Table 13. Statistics of non-coding RNAs.

Type Subtype Copy Average

Length (bp)

Total

Length (bp) % Genome

miRNA 1,295 211 273,763 0.070

tRNA 1,391 75 104,488 0.027

rRNA 1,730 299 517,909 0.133

18S 677 499 337,911 0.087

28S 573 120 68,828 0.018

5.8S 244 335 81,737 0.021

5S 236 125 29,433 0.007

snRNA 3,511 84 295,668 0.076

CD-box 3,201 79 253,329 0.065

HACA-box 130 120 15,639 0.004

splicing 180 148 26,700 0.007

33

Supplementary Table 14. Detailed sub-genome information of B. rapa genome V2.0.

(excel table)

34

Supplementary Table 15. Dominant gene expression between sub-genomes LF and

MFs in B. rapa genome V2.0.

Organisms #2-fold changes*

Not

expressed

Bionominal test

(LF & MFs)

LF MF1 MF2

leaf 300 178 154 87 2.61E-13

stem 280 183 154 71 6.29E-10

root 264 197 165 48 4.40E-06

*: number of genes expressed at least two-fold higher compared to both the other two

syntenic genes.

35

Supplementary Table 16. Sequence homology between B. rapa version-specific

genes and other Brassica species.

Species

V1.5 V2.0

hits

identity>=70%

coverage>=70% hits

identity>=70%

coverage>=70%

A. thaliana 2,278 972 4,631 1,274

A. lyrata 2,300 945 4,459 1,257

B. oleracea 3,194 1,693 6,512 2,541

C. rubella 2,223 883 4,602 1,226

T. parvula 2,293 976 4,910 1,345

Total 3,244 1,758 6,687 2,642

36

Supplementary Table 17. Enriched GO terms of over retained genes in B. rapa


37

Supplementary Table 18. Enriched GO terms of under retained genes in B. rapa


38

Supplemental Figures

Supplemental Figure 1. An example of integrating information from linkage map

and synteny map between B. rapa and (A) A. thaliana or (B) S. parvula to determine

the order of scaffolds in local regions of chromosomes (A04). Red plots represent the

genetic map of the RILs population, and green plots represent the genetic map of the

F1DH population.

39

Supplemental Figure 2. Scatterplot showing the correspondence between physical

position and genetic distance of RILs (blue) and F1DH (red) populations in ten

chromosomes of B. rapa.

40

Supplemental Figure 3. Distribution of total tandem genes and tandem arrays along

chromosomes in B. rapa genome V1.5 and V2.0. A 6-Mb sliding window with a 1-Mb

step was applied to screen tandem genes and tandem arrays across ten chromosomes.

The y-axis denotes the number of tandem genes and tandem arrays in each window.

41

Supplemental Figure 4. Comparisons of gene annotations in the two assemblies of B.

rapa genome. (A) Classification of corresponding gene-pairs in the two gene

annotations. (B) Illustration of the classification of gene-pairs between the two

assemblies, namely one-to-one gene-pairs, one-to-multiple gene arrays, and

multiple-to-one gene-pairs. (C) Examples of the sequence alignment of one-to-one

gene arrays show the variations in the two genesets. The first shows only a point

mutation between the two protein sequences; the second shows a gene in V2.0 to be

annotated with more coding sequences than V1.5; and the third shows a gene in V1.5

to have more coding sequences than V2.0. The fourth example shows the combined

variations of the several cases given above.

42

Supplemental Figure 5. Comparisons of TEs between B. rapa genome V1.5 and

V2.0. (A) Frequency distributions of TE subgroups in the two assemblies of B. rapa

genome. (B) Number and frequency distributions of TE with different lengths in the

two assemblies. The x-axis stands for the lengths of TEs, and the y-axis denotes the

TE counts (left) and percentage (right).

43

Supplemental Figure 6. Distribution of TEs along ten chromosomes in B. rapa

genome V1.5 and V2.0. A 500-kb sliding window with a 100-kb step was applied to

screen TEs across the ten chromosomes. The y-axis denotes the ratio of TE sequences

in each window.

44

Supplemental Figure 7. Examples showing more TEs were annotated in

corresponding intergenic regions of the two genome assemblies of B. rapa. Blue

triangles represent TEs located in the intergenic regions. Red arrows link

corresponding gene pairs that show syntenic relationships. (A) A new TE in V2.0

shows limited similarity (coverage=71.9%) to intergenic region between genes

Bra000493 and Bra000494. (B) Three more TEs were annotated between

BraA04001951 and BraA04001952 than between Bra034310 and Bra034311.

45

Supplemental Figure 8. The density of orthologous genes in three sub-genomes (LF,

MF1 and MF2) of B. rapa genome V2.0 compared to A. thaliana. The x-axis denotes

seven ancestral chromosomes of Brassica anceatral genomes. The y-axis denotes the

percentage of retained orthologous genes in B. rapa sub-genomes around each A.

thaliana gene, with a total window size of 1,001 genes, 500 genes flanking each side

of a certain gene. Part of D block was lost in sub-genome LF, resulted in the

decreased gene density in the head of tPCK7.

46

Supplemental Figure 9. RNA+ TEs show negative association with expression levels

of gene doublets between sub-genome LF and MF1, LF and MF2. (A) Dominantly

expressed genes from sub-genome LF show low level of RNA+ TEs in the 2 Kb of 5′

UTR regions. dLF denotes dominantly expressed genes from LF. (B) Dominantly

expressed genes from sub-genome MF1 show low level of RNA+ TEs in the 2 Kb of

5′ UTR regions. dMF1 denotes dominantly expressed genes from MF1. 1,046 and 665

gene doublets of dLF-MF1 and LF-dMF1 were considered respectively. (C)

Dominantly expressed genes from sub-genome LF show low level of RNA+ TEs in

the 2 kb of 5′ UTR regions. (D) Dominantly expressed genes from sub-genome MF1

show low level of RNA+ TEs in the 2 kb of 5′ UTR regions. dMF2 denotes

dominantly expressed genes from MF2. 878 and 533 gene doublets of dLF-MF2 and

LF-dMF2 were considered respectively.

47

Supplemental References

.

Birney, E., Clamp, M., and Durbin, R. (2004). GeneWise and genomewise. Genome research

14:988-995.

Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. (2011). Scaffolding pre-assembled

contigs using SSPACE. Bioinformatics 27:578-579.

Cheng, F., Sun, C., Wu, J., Schnable, J., Woodhouse, M.R., Liang, J., Cai, C., Freeling, M., and Wang, X.

(2016). Epigenetic regulation of subgenome dominance following whole genome triplication

in Brassica rapa. The New phytologist 211:288-299.

Cheng, F., Wu, J., Fang, L., and Wang, X. (2012). Syntenic gene analysis between Brassica rapa and

other Brassicaceae species. The Brassica Genome:5.

English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., and Worley,

K.C. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read

sequencing technology. PloS one 7:e47768.

Feng, C., Jian, W., Lu, F., Silong, S., Bo, L., Ke, L., Guusje, B., and Xiaowu, W. (2012). Biased Gene

Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa. PloS

one 7:e36442.

Freeling, M. (2009). Bias in plant gene content following different sorts of duplication: tandem,

whole-genome, segmental, or by transposition. Annual review of plant biology 60:433-453.

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L.,

Raychowdhury, R., and Zeng, Q. (2011). Trinity: reconstructing a full-length transcriptome

without a genome from RNA-Seq data. Nature biotechnology 29:644.

Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., and Bateman, A. (2005). Rfam:

annotating non-coding RNAs in complete genomes. Nucleic acids research 33:D121-D124.

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Hannick, L.I., Maiti, R., Ronning,

C.M., Rusch, D.B., and Town, C.D. (2003). Improving the Arabidopsis genome annotation

using maximal transcript alignment assemblies. Nucleic acids research 31:5654-5666.

Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White, O., Buell, C.R., and Wortman,

J.R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the

Program to Assemble Spliced Alignments. Genome biology 9:R7.

Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty,

L., and Duquenne, L. (2009). InterPro: the integrative protein signature database. Nucleic

acids research 37:D211-D215.

Larkin, M.A., Blackshields, G., Brown, N., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F.,

Wallace, I.M., Wilm, A., and Lopez, R. (2007). Clustal W and Clustal X version 2.0.

Bioinformatics 23:2947-2948.

Larose, D.T. (2005). k‐Nearest Neighbor Algorithm. Discovering Knowledge in Data: An Introduction

to Data Mining:90-106.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv

preprint arXiv:1303.3997.

Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., and Wang, J. (2009a). SNP detection for

massively parallel whole-genome resequencing. Genome research 19:1124-1132.

Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009b). SOAP2: an improved

ultrafast tool for short read alignment. Bioinformatics 25:1966-1967.

48

Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA

genes in genomic sequence. Nucleic acids research 25:955-964.

Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., and Liu, Y. (2012).

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

GigaScience 1:1-6.

Parra, G., Bradnam, K., and Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in

eukaryotic genomes. Bioinformatics 23:1061-1067.

Salmela, L., and Rivals, E. (2014). LoRDEC: accurate and efficient long read error correction.

Bioinformatics:btu538.

She, R., Chu, J.S.-C., Wang, K., Pei, J., and Chen, N. (2009). GenBlastA: enabling BLAST to identify

homologous gene sequences. Genome research 19:143-149.

Simao, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M. (2015). BUSCO:

assessing genome assembly and annotation completeness with single-copy orthologs.

Bioinformatics 31:3210-3212.

Sun, C., Wu, J., Liang, J., Schnable, J.C., Yang, W., Cheng, F., and Wang, X. (2015). Impacts of

Whole-Genome Triplication on MIRNA Evolution in Brassica rapa. Genome biology and

evolution 7:3085-3096.

Sun, X., Liu, D., Zhang, X., Li, W., Liu, H., Hong, W., Jiang, C., Guan, N., Ma, C., and Zeng, H. (2013).

SLAF-seq: an efficient method of large-scale de novo SNP discovery and genotyping using

high-throughput sequencing. PLoS One 8:e58700.

Tarailo‐Graovac, M., and Chen, N. (2009). Using RepeatMasker to identify repetitive elements in

genomic sequences. Current Protocols in Bioinformatics:4.10. 11-14.10. 14.

Van Ooijen, J. (2006). JoinMap® 4, Software for the calculation of genetic linkage maps in experimental

populations. Kyazma BV, Wageningen 33:10.1371.

Wang, F., Li, L., Li, H., Liu, L., Zhang, Y., Gao, J., and Wang, X. (2012). Transcriptome analysis of rosette

and folding leaves in Chinese cabbage using high-throughput RNA sequencing. Genomics

99:299-307.

Wang, X., and Cheng, F. (2016). Epigenetic regulation of subgenome dominance following whole

genome triplication in Brassica rapa. New Phytologist:1.

Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.-H., Bancroft, I., and Cheng, F.

(2011). The genome of the mesopolyploid crop species Brassica rapa. Nature genetics

43:1035-1039.

Xu, Z., and Wang, H. (2007). LTR_FINDER: an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic acids research 35:W265-W268.

Yu, X., Wang, H., Zhong, W., Bai, J., Liu, P., and He, Y. (2013). QTL mapping of leafy heads by genome

resequencing in the RIL population of Brassica rapa. PloS one 8:e76059.

1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...

Documents

Transcript of 1 Supplemental Information 2 3 Brassica rapa genome 2.0: a ... · 1 1 Supplemental Information 2 3...