The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l...

38
The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17 , Jianguo Zhang 3,4,17 , Honghe Sun 1,2,5,17 , Jerome Salse 6,17 , William J. Lucas 7,17 , Haiying Zhang 1 , Yi Zheng 2 , Linyong Mao 2 , Yi Ren 1 , Zhiwen Wang 3 , Jiumeng Min 3 , Xiaosen Guo 3 , Florent Murat 6 , Byung-Kook Ham 7 , Zhaoliang Zhang 7 , Shan Gao 2 , Mingyun Huang 2 , Yimin Xu 2 , Silin Zhong 2 , Aureliano Bombarely 2 , Lukas A. Mueller 2 , Hong Zhao 1 , Hongju He 1 , Yan Zhang 1 , Zhonghua Zhang 8 , Sanwen Huang 8 , Tao Tan 9 , Erli Pang 9 , Kui Lin 9 , Qun Hu 10 , Hanhui Kuang 10 , Peixiang Ni 3,4 , Bo Wang 3 , Jingan Liu 1 , Qinghe Kou 1 , Wenju Hou 1 , Xiaohua Zou 1 , Jiao Jiang 1 , Guoyi Gong 1 , Kathrin Klee 11 , Heiko Schoof 11 , Ying Huang 3 , Xuesong Hu 3 , Shanshan Dong 3 , Dequan Liang 3 , Juan Wang 3 , Kui Wu 3 , Yang Xia 1 , Xiang Zhao 3 , Zequn Zheng 3 , Miao Xing 3 , Xinming Liang 3 , Bangqing Huang 3 , Tian Lv 3 , Junyi Wang 3 , Ye Yin 3 , Hongping Yi 12 , Ruiqiang Li 13 , Mingzhu Wu 12 , Amnon Levi 14 , Xingping Zhang 1 , James J. Giovannoni 2,15 , Jun Wang 3,16 , Yunfu Li 1 , Zhangjun Fei 2,15 & Yong Xu 1 1 National Engineering Research Center for Vegetables, Beijing Academy of Agriculture and Forestry Sciences, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (North China), Beijing, China. 2 Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, NY, USA. 3 BGI-Shenzhen, Chinese Ministry of Agriculture, Key Lab of Genomics, Shenzhen, China. 4 T-Life Research Center, Fudan University, Shanghai, China. 5 College of Plant Science and Technology, Beijing University of Agriculture, Beijing, China. 6 INRA, UMR 1095, Genetics, Diversity and Ecophysiology of Cereals, F-63100 Clermont-Ferrand, France. 7 Deptartment of Plant Biology, College of Biological Sciences, University of California, Davis, CA, USA. 8 Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China. 9 College of Life Sciences, Beijing Normal University, Beijing, China. 10 College of Horticulture and Forestry, Huazhong Agriculture University, Wuhan, China. 11 INRES Crop Bioinformatics, University of Bonn, Katzenburgweg 2, 53115 Bonn, Germany Nature Genetics: doi:10.1038/ng.2470

Transcript of The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l...

Page 1: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

The draft genome of watermelon (Citrullus lanatus)

and resequencing of 20 diverse accessions

Shaogui Guo1,2,17

, Jianguo Zhang3,4,17

, Honghe Sun1,2,5,17

, Jerome Salse6,17

, William J.

Lucas7,17

, Haiying Zhang1, Yi Zheng

2, Linyong Mao

2, Yi Ren

1, Zhiwen Wang

3, Jiumeng Min

3,

Xiaosen Guo3, Florent Murat

6, Byung-Kook Ham

7, Zhaoliang Zhang

7, Shan Gao

2, Mingyun

Huang2, Yimin Xu

2, Silin Zhong

2, Aureliano Bombarely

2, Lukas A. Mueller

2, Hong Zhao

1,

Hongju He1, Yan Zhang

1, Zhonghua Zhang

8, Sanwen Huang

8, Tao Tan

9, Erli Pang

9, Kui Lin

9,

Qun Hu10

, Hanhui Kuang10

, Peixiang Ni3,4

, Bo Wang3, Jingan Liu

1, Qinghe Kou

1, Wenju Hou

1,

Xiaohua Zou1, Jiao Jiang

1, Guoyi Gong

1, Kathrin Klee

11, Heiko Schoof

11, Ying Huang

3,

Xuesong Hu3, Shanshan Dong

3, Dequan Liang

3, Juan Wang

3, Kui Wu

3, Yang Xia

1, Xiang

Zhao3, Zequn Zheng

3, Miao Xing

3, Xinming Liang

3, Bangqing Huang

3, Tian Lv

3, Junyi

Wang3, Ye Yin

3, Hongping Yi

12, Ruiqiang Li

13, Mingzhu Wu

12, Amnon Levi

14, Xingping

Zhang1, James J. Giovannoni

2,15, Jun Wang

3,16, Yunfu Li

1, Zhangjun Fei

2,15 & Yong Xu

1

1National Engineering Research Center for Vegetables, Beijing Academy of Agriculture and

Forestry Sciences, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops

(North China), Beijing, China.

2Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, NY, USA.

3BGI-Shenzhen, Chinese Ministry of Agriculture, Key Lab of Genomics, Shenzhen, China.

4T-Life Research Center, Fudan University, Shanghai, China.

5College of Plant Science and Technology, Beijing University of Agriculture, Beijing, China.

6INRA, UMR 1095, Genetics, Diversity and Ecophysiology of Cereals, F-63100

Clermont-Ferrand, France.

7Deptartment of Plant Biology, College of Biological Sciences, University of California,

Davis, CA, USA.

8Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing,

China.

9College of Life Sciences, Beijing Normal University, Beijing, China.

10College of Horticulture and Forestry, Huazhong Agriculture University, Wuhan, China.

11INRES Crop Bioinformatics, University of Bonn, Katzenburgweg 2, 53115 Bonn, Germany

Nature Genetics: doi:10.1038/ng.2470

Page 2: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

12Xinjiang Academy of Agricultural Sciences, Urumqi, China.

13Beijing Novogene Bioinformation Technology Co. Ltd, Beijing, China

14USDA, ARS, U.S. Vegetable Lab, 2700 Savannah Highway, Charleston, SC, USA.

15USDA Robert W. Holley Center for Agriculture and Health, Tower Road, Ithaca, NY, USA.

16Department of Biology, University of Copenhagen, Copenhagen, Denmark.

17These authors contributed equally to this work.

Correspondence should be addressed to Yong Xu ([email protected]), Zhangjun Fei

([email protected]), Yunfu Li ([email protected]) or Jun Wang ([email protected]).

Nature Genetics: doi:10.1038/ng.2470

Page 3: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

Table of contents

1. Supplementary Note................................................................................................... 1

S1 Genome sequencing, assembly and quality assessment .................................... 1

S1.1 Whole genome shot-gun sequencing using the Illumina technology ...... 1

S1.2 De novo assembly of the watermelon genome ........................................ 1

S1.3 Evaluation of the effect of sequence depth and large-insert reads on the

quality of genome assemblies .......................................................................... 1

S1.4 Unassembled genome evaluation ............................................................ 2

S1.5 Sequencing of BAC ends and full BAC clones ....................................... 3

S1.6 Evaluation of the quality of the assembled watermelon genome .......... 3

S1.6.1 Gene coverage ............................................................................ 3

S1.6.2 Genome coverage ....................................................................... 4

S1.6.3 Structural correctness of watermelon genome assembly ............ 4

S2 Genome annotation ............................................................................................ 5

S2.1 Repeat annotation .................................................................................... 5

S2.1.1 De novo identification of repeat sequences ................................ 5

S2.1.2 Employment of Repbase for repeat identification ...................... 6

S2.1.3 Classification of de novo TEs ..................................................... 6

S2.2 Functional annotation of watermelon genes ............................................ 7

S2.3 non-coding RNA (ncRNA) annotation .................................................... 7

S3 Watermelon chromosome evolution analysis .................................................... 7

S3.1 Dating of paralogous and ortholougous gene pairs ................................. 7

S4 Genome resequencing ........................................................................................ 8

S4.1 Validation of SNPs and small indels........................................................ 8

S4.2 Distribution of SNPs and small indels across the watermelon genome .. 8

S4.3 Phylogenetic relationship and population structure analyses .................. 8

S4.4 Selective sweep analysis .......................................................................... 9

S5 Disease resistance-related genes ........................................................................ 9

S5.1 Identification of disease resistance genes ................................................ 9

S5.2 Coverage of watermelon NBS-LRR genes by the genome assembly ... 10

S5.3 Watermelon NBS-LRR genes in semi-wild and wild accessions .......... 11

S6 Comparative analysis of cucurbit phloem sap and vascular transcriptomes ... 11

S6.1 Identification of phloem sap transcripts ................................................ 11

S6.2 Comparative analysis of phloem sap and vascular transcripts .............. 12

S7 Regulation of watermelon fruit development and quality ............................... 12

S7.1 Model of sugar accumulation in watermelon fruit flesh ........................ 12

S7.2 Identification and classification of transcription factors ....................... 14

S7.3 Identification of sucrose-controlled upstream open reading frame

(SC-uORF) containing bZIP transcription factors.......................................... 14

S7.4 MADS box genes in watermelon and cucumber genomes .................... 15

2. Supplementary References....................................................................................... 16

Nature Genetics: doi:10.1038/ng.2470

Page 4: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

1

Supplementary Note

S1 Genome sequencing, assembly and quality assessment

S1.1 Whole genome shot-gun sequencing using the Illumina technology

Illumina short-insert paired-end (clone size: 100, 200, and 400 bp) and large-insert

mate-pair (2, 5, 10 and 20 kb) libraries were prepared following the manufacturer’s

instructions. For construction of mate-pair libraries, DNA circularization, digestion of

linear DNA, fragmentation of circularized DNA, and purification of biotinylated DNA

were performed prior to adapter ligation. The template DNA fragments of the

constructed libraries were hybridized to the surface of flow cells, amplified to form

clusters and then sequenced on the Illumina GAII system, based on the standard

Illumina protocol.

S1.2 De novo assembly of the watermelon genome

In order to achieve a high quality assembled genome, raw Illumina reads were

processed to remove low quality reads, adaptor sequences, and possible contaminated

reads of bacterial and viral origins. The resulting high-quality cleaned reads from

libraries with insert size ranging from 100 to 400 bp were assembled into contigs

using SOAPdenovo, a de Bruijn graph based assembly software1. Then paired-end

relationships from all PE library reads were used to join contigs into scaffolds. Finally,

we used entire short-insert reads (100-400 bp) to fill in gaps within scaffolds.

S1.3 Evaluation of the effect of sequence depth and large-insert reads on the

quality of genome assemblies

The sequence depth of reads from all short-insert paired-end libraries that we

generated provided 83.7X coverage of the watermelon genome. To investigate the

effect of sequence depth on the watermelon genome assembly, we randomly chose

different depth of data, 20X, 40X, 50X, 60X, 65X, 70X, 75X and 80X, from each lane,

and then assembled these data independently. From these studies, the sequence depth

approached saturation at 50X, based on the slow growth of the total scaffold length;

Nature Genetics: doi:10.1038/ng.2470

Page 5: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

2

however, N50 scaffold size showed an obvious positive correlation with the sequence

depth (Supplementary Fig. 4a and 4b). We then investigated the effect of

large-insert reads on the watermelon genome assembly. Six independent assemblies

were performed using combinations of cleaned reads from libraries of different insert

sizes: 100-200 bp, 400 bp, 2 kb, 5 kb, 10 kb and 20 kb. The first assembly used only

reads from the 100-200 bp insert libraries; whereas the subsequent assemblies were

performed using reads from the next longer insert libraries combined with reads used

in the previous assembly. Our results indicated that, by including reads from

large-insert libraries, both N50 size (from approximately 25 kb to 2.4 Mb) and total

length (from approximately 290 Mb to 350 Mb) of the watermelon genome assembly

were significantly increased (Supplementary Fig. 4c and 4d). In summary, our

analysis revealed that higher-depth sequencing of the watermelon genome and

including reads from large-insert mate-pair libraries, substantially improved assembly

efficiency, indicating a positive impact on the cost-benefit ratio of high-quality

genome sequence assembly.

S1.4 Unassembled genome evaluation

Unassembled reads were obtained after mapping the cleaned reads to the assembled

watermelon genome using the SOAPaligner with default parameters. Approximately

17.4% of the cleaned reads could not be aligned to the assembly and were therefore

regarded as “unassembled”; the percentage of unassembled reads was largely

consistent with the estimated unassembled portion of the genome (approximately

16.8%; 76.5 Mb out of a total of 425 Mb).

Three lanes of Illumina runs with 75 bp reads, which are suitable for BLAST

searches, were randomly selected from independent DNA libraries to investigate the

properties of unassembled reads. The unassembled reads from these lanes were

re-aligned to the genome assembly using BLASTN with less stringent criteria

(E-value of 1e-10, word size of 20 and low complexity filtering turned off) than

SOAPaligner. The alignments were further filtered and only those with lengths and

identities larger than 60 bp and 80%, respectively, were retained (Supplementary

Nature Genetics: doi:10.1038/ng.2470

Page 6: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

3

Table 3). The distribution of “unassembled” reads mapped by BLAST onto individual

chromosomes is shown in Supplementary Fig. 2. This analysis clearly showed that

the majority of these unassembled reads mapped to the centromeric and

pericentromeric regions of the chromosomes. Additionally, in these “unassembled”

regions with high read depths, we were able to determine three repeat units, each of

which shared similarity to sequences related to centromeres, telomeres or 45S rDNA

clusters. Furthermore, FISH analyses confirmed the existence of these three types of

repeats in the watermelon genome (Fig. 1).

S1.5 Sequencing of BAC ends and full BAC clones

Both ends of 1,152 randomly selected clones from the BAC library of watermelon

inbred line 97103 (ref. 2) were sequenced with an Applied Biosystems 3730xl DNA

Analyzer. We obtained a total of 1,529 high quality sequences for 862 BAC clones,

among which 667 clones had high quality sequences from both ends and 195 clones

had sequences only from one end. The sequences are publicly available at the

Cucurbit Genomics Database (http://www.icugi.org).

Four BAC clones, two located in gene-rich euchromatin regions and the other two

located in centromere highly repetitive regions, were fully sequenced with an Applied

Biosystems 3730xl DNA Analyzer. The sequences were deposited into GenBank

under accessions JN402338, JN402339, JX027061, and JX027062.

S1.6 Evaluation of the quality of the assembled watermelon genome

S1.6.1 Gene coverage

A total of 1,064,502 watermelon expressed sequence tags (ESTs) collected from

various sources including NCBI dbEST database

(http://www.ncbi.nlm.nih.gov/dbEST/) and the cucurbit genomics database

(http://www.icugi.org), were used to assess the gene coverage of the watermelon

genome assembly. ESTs were aligned to the genome assembly using BLAT3. Only

ESTs with alignments of identity ≥ 0.9 and coverage ≥ 0.5 were kept. The analysis

Nature Genetics: doi:10.1038/ng.2470

Page 7: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

4

indicated that the genome assembly had a high coverage of gene coding regions

(~97%; Supplementary Table 4).

S1.6.2 Genome coverage

The four fully sequenced watermelon BACs were aligned to the genome assembly

using NCBI blast program with filter of low complexity regions set to off (-F F). Only

alignments with sequence identify ≥ 0.98 were kept. The genome assembly covered

97.8% and 97.6% of the two BACs (GenBank accessions: JN402338 and JN402339)

that were located in the gene rich euchromatin regions, respectively; whereas it only

covered 90.2% and 64.2% of the two BACs (GenBank accessions: JX027061 and

JX027062) that were located in centromere highly repetitive regions, respectively

(Supplementary Fig. 3). The low coverage of the genome assembly on the two

highly repetitive BAC sequences is not uncommon, especially for genomes generated

using next-generation sequencing (NGS) technologies. Nonetheless, high coverage in

gene-rich euchromatin regions and the fact that ~17% of the unassembled genome is

mainly repeat sequences, indicated by our analysis described above (S1.4), confirmed

the high quality of the watermelon genome assembly.

S1.6.3 Structural correctness of watermelon genome assembly

One common error in de novo genome assemblies is that two contigs are incorrectly

joined into one scaffold, resulting in local assembly errors. The alignments of the four

full BAC sequences did not identify any local assembly errors (Supplementary Fig.

3). To further study the structural correctness of the watermelon genome assembly, the

paired-end sequences of 667 BAC clones were aligned to the scaffolds using

BLASTN. In order to ensure unambiguous mapping, only sequences of at least 300 nt

that aligned to a unique location with a coverage of 95% or more and an identity of

99% or better were used. In total, 341 (51.1%) of the BAC end sequence pairs could

be aligned to a unique position on the scaffolds with these stringent criteria. Pairs of

end sequences that aligned to a single scaffold with incorrect orientation (i.e., both

end sequences aligned to the same strand), or at a too large distance from each other

Nature Genetics: doi:10.1038/ng.2470

Page 8: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

5

(more than 200 kb) were considered indicators of potential assembly errors. Out of the

302 pairs aligned to same scaffolds, none were aligned inconsistently with the

genome assembly (Supplementary Table 5).

Despite the absence of local assembly errors, identified based on our analysis of

the relatively small scale of full BAC and BAC end sequences, our genome assembly

would inevitably contain a small portion of this kind of errors as this is common in all

genome assembly projects. Indeed, during our scaffold anchoring based on the

high-density genetic map, we did find two scaffolds with total size 5.76 Mb (1.6% of

the assembled genome) that were not consistent with the genetic map. However, at

this stage we could not determine whether the inconsistency is due to genome

assembly errors or to errors in the genetic map. Nonetheless, our analysis confirmed

the overall structural correctness of scaffolds of the watermelon genome assembly.

Furthermore, out of the 39 BAC end sequence pairs mapped to different scaffolds,

two were aligned to different chromosomes, indicating potential errors in scaffold

anchoring or errors caused by potential chimeric BAC clones (Supplementary Table

6).

In summary, our extensive analyses confirmed the high quality of the de novo

watermelon genome assembly.

S2 Genome annotation

S2.1 Repeat annotation

S2.1.1 De novo identification of repeat sequences

We first used PILER4 and RepeatScout

5 for repeat sequence identification in the

watermelon genome assembly. LTR retrotransposons were identified with

LTR_FINDER6 with default parameters. All repeat sequences with lengths >100 bp

and gap “N” less than 5% constituted the raw transposable element (TE) library.

Second, we used all-versus-all BLASTN (E-value ≤ 1e-10) to search against the raw

transposable element (TE) library, and sequences were filtered when two repeats

aligned with identity ≥ 80%, coverage ≥ 80% and minimal matching length ≥ 100 bp;

Nature Genetics: doi:10.1038/ng.2470

Page 9: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

6

this yielded a non-redundant TE library. Next, all non-redundant repeats were

searched against the SwissProt protein database to filter out protein-coding genes by

BLASTX (E-value ≤ 1e-4, identity ≥ 30%, coverage ≥ 30% and the minimal matching

length ≥30 aa). After manual correction, a de novo TE library for the watermelon

genome was obtained. RepeatClassifier was then used to classify repeat models for

the de novo TE library to establish a final classified de novo TE library. Finally, we

used RepeatProteinMask and RepeatMasker (http://www.repeatmasker.org) with the

final classified de novo TE library to search the assembled genome to locate TE loci.

S2.1.2 Employment of Repbase for repeat identification

We used RepeatProteinMask, RepeatMasker (http://www.repeatmasker.org) and the

known repbase library (http://www.girinst.org/repbase/index.html) to find TE repeats

in the assembled genome. TEs were identified both at the DNA and protein level.

RepeatMasker was applied for DNA-level identification using a custom library (a

combination of Repbase, plant repeat database and our genome de novo TE library).

At the protein level, RepeatProteinMask was used to perform WU-BLASTX against

the TE protein database. Overlapping TEs belonging to the same type of repeats were

integrated, whereas those with low scores were removed if they overlapped > 80%

and belonged to different types.

S2.1.3 Classification of de novo TEs

A hierarchical system was used to classify de novo TEs. This system involved the

following steps: 1) BLASTN against Repbase; 2) BLASTX against TE proteins; 3)

BLASTN against plant/animal repeat databases; 4) BLASTX against SwissProt

proteins; 5) TBLASTX against Repbase; and 6) TBLASTX against plant/animal

repeat databases. For each step, TEs having significant hits with known repeats were

assigned a type either at the DNA level (E-value ≤ 1e-10, identity ≥ 80%, coverage

≥30% and the minimal matching length ≥ 80 bp) or at the protein level (E-value ≤1e-4,

identity ≥ 30%, coverage ≥ 30% and the minimal matching length ≥ 30 aa). LTR

Nature Genetics: doi:10.1038/ng.2470

Page 10: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

7

retrotransposons identified by LTR_FINDER were classified as "unclassified LTR" if

they had no homology to known repeats.

In order to understand the evolution and bursts of LTR retrotransposons, we

aligned both ends of each pair of LTR retrotransposons. Then the divergence of the

LTR pairs was calculated using the distmat program implemented in the EMBOSS

package with the Kimura two-parameter model. Finally, the insertion time T was

calculated as T = K/r, with r as the rate of nucleotide substitution and K as the distance.

The molecular clock was set as 6.5 x 10-9

per site per year7.

S2.2 Functional annotation of watermelon genes

The predicted watermelon genes were compared to SwissProt, TrEMBL and

Arabidopsis protein databases8 using NCBI BLASTP (E-value ≤ 1e-4). Functional

domains of watermelon genes were identified by comparing their sequences against

protein databases including Pfam, PRINTS, PROSITE, ProDom, and SMART using

InterProScan9. Gene Ontology (GO) terms for each gene were obtained from the

corresponding InterPro entries. Based on the results from BLASTP and InterProScan,

and GO term information for each gene, functions of predicted watermelon genes

were assigned using the AHRD pipeline (Automated assignment of Human Readable

Descriptions) as described previously10

.

S2.3 Non-coding RNA (ncRNA) annotation

tRNA genes were identified by tRNAscan-SE11

with default parameters. The C/D box

snoRNAs were identified by Snoscan12

. Other ncRNAs, including miRNAs, snRNAs,

and H/ACA box snoRNAs were identified using INFERNAL software by searching

against the Rfam13

database with default parameters.

S3 Watermelon chromosome evolution analysis

S3.1 Dating of paralogous and orthologous gene pairs

We performed sequence divergence as well as speciation event dating analysis based

Nature Genetics: doi:10.1038/ng.2470

Page 11: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

8

on the rate of nonsynonymous (Ka) vs. synonymous (Ks) substitutions, calculated

with MEGA5 (ref. 14). The average substitution rate (r) of 6.5 × 10-9

substitutions per

synonymous site per year was used to calibrate the age of the considered genes7. The

time (T) since gene insertion was then estimated using the formula T = Ks/r (ref.

15,16).

S4 Genome resequencing

S4.1 Validation of SNPs and small indels

To confirm the SNP and small indel calling, a total of 75 non-overlapping genome

regions (~50 kb) were randomly selected in each of the 20 watermelon accessions,

PCR amplified, and sequenced with an Applied Biosystems 3730xl DNA Analyzer.

The resulting sequences were first processed to remove low quality regions (phred

quality score < 30) and then aligned to the reference 97103 genome by BWA17

.

S4.2 Distribution of SNPs and small indels across the watermelon genome

The majority of the 6.8 million SNPs (88.9%) we identified were located in intergenic

regions, whereas only 2.9% were in coding regions. The ratio of nonsynonymous

(97,933) to synonymous (94,225) substitutions was 1.04, which is higher than that of

Arabidopsis18

(0.83) but lower than that of soybean19

(1.61) and rice20

(1.29). Of the

965,006 indels, 88.3% were located in intergenic regions, whereas only 0.57% (5,531)

were located in coding regions.

S4.3 Phylogenetic relationship and population structure analyses

The neighbor-joining tree contained four major groups, corresponding to the

cultivated C. lanatus subsp. vulgaris East-Asia ecotype and America ecotype, C.

lanatus subsp. mucosospermus and C. lanatus subsp. lanatus (Fig. 3a). PCA indicated

that the C. lanatus subsp. lanatus group was clearly separated from other groups using

the first and second eigenvectors (Fig. 3b). The wide dispersal of the C. lanatus subsp.

lanatus group indicated its higher diversity. In our samples, C. lanatus subsp. vulgaris

Nature Genetics: doi:10.1038/ng.2470

Page 12: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

9

was very closely related to C. lanatus subsp. mucosospermus. The two cultivated

subgroups, C. lanatus subsp. vulgaris East-Asia ecotype and America ecotype, are

nearly indistinguishable, supporting the low level of genetic diversity associated with

cultivated watermelon.

Additional analysis of population structure was performed using the FRAPPE

program21

. Here, we analyzed the data by increasing K (the number of populations)

from 2 to 5 (Fig. 3c). For K = 2, we identified a division between C. lanatus subsp.

lanatus and the other 17 accessions. Based on K = 3, the 21 watermelon accessions

were clearly divided into three groups, including C. lanatus subsp. lanatus, C. lanatus

subsp. vulgaris and C. lanatus subsp. mucosospermus. Using K = 4, C. lanatus subsp.

vulgaris was divided into two subgroups, East-Asia ecotype and America ecotype,

and for K = 5, a new subgroup within the C. lanatus subsp. mucosospermus group

emerged.

S4.3 Selective sweep analysis

In addition to potential selective sweeps, we also identified regions with lowest (top

1%) πmucosospermus/πvulgaris. In contrast to selective sweeps, these regions represent those

with significantly higher levels of polymorphisms in cultivated watermelon C. lanatus

subsp. vulgaris compared with C. lanatus subsp. mucosospermus, thus they can serve

as a negative control of selective sweeps. A total of 95 regions of 7.19 Mb in size,

containing 477 genes, were identified (Supplementary Table 17). GO term analysis

indicated that only several biological processes were enriched in those 477 genes

when compared to the whole genome and none of them were associated with known

selected traits (Supplementary Table 18). As expected, this is in contrast to genes in

potential selective sweeps that were highly enriched with biological processes related

to important selected traits (Supplementary Table 16).

S5 Disease resistance-related genes

S5.1 Identification of disease resistance genes

Nature Genetics: doi:10.1038/ng.2470

Page 13: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

10

Genes encoding nucleotide-binding site (NBS) and leucine-rich repeat (LRR)

domains were identified following a two-step process. First, watermelon protein

sequences were screened against the Pfam database for the presence of the NBS

(NB-ARC) family domain (PF00931)22

using the HMMER3 program

(http://hmmer.janelia.org). Second, conserved motifs were then derived from the

domain profiles retrieved from the Pfam and SMART

(http://smart.emblheildelberg.de) databases and from the COILS Server23

with a

probability ≥ 90% to detect CC domains specifically.

Lipoxygenase (LOX) proteins were identified by comparing watermelon protein

sequences to the InterPro database to search for the lipoxygenase domain (IPR001024

or IPR000907).

To identify LRR-containing receptor-like proteins (RLPs) and receptor-like

kinases (RLKs), the TMHMM Server (http://www.cbs.dtu.dk/services/TMHMM) was

used to search for trans-membrane domains in watermelon protein sequences. Then,

the sequences were compared to the pfam domain database to search for the LRR

domain (PF00560) and protein kinase domain (PF00069).

S5.2 Coverage of watermelon NBS-LRR genes by the genome assembly

The watermelon genome assembly contains considerably fewer NBS-LRR genes (44)

than other plant species such as rice, apple and maize. In this study we sought to

check whether this low number of identified NBS-LRR genes is due to the incomplete

coverage by the genome assembly of genes from this family. We first blasted the 44

NBS-LRR protein sequences to a watermelon EST dataset with a low stringency

cutoff (e value < 1e-5) to identify potential NBS-LRR genes in this EST collection,

which contains ~75K unigenes assembled from ~600K EST sequences, with the

majority being generated using the 454 sequencing technology (http://www.icugi.org;

ref24). From this analysis, we obtained a total of 27 unigenes, among which 23 were

covered by the NBS-LRR genes. Alignments of these 23 unigenes to the NBS-LRR

genes indicated that they all have at least 99% identity if we removed homopolymer

errors in the unigene sequences. Detailed examination of the remaining four unigenes

Nature Genetics: doi:10.1038/ng.2470

Page 14: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

11

indicated that they are all covered by the watermelon genome assembly, with three

(WMU79003, WMU36300 and WMU43890) covered by predicted genes (Cla019850,

Cla019073 and Cla015037, respectively). These three genes (two Leucine Rich

Repeat family proteins and one LRR receptor-like serine/threonine-protein kinase) all

lack the typical NB-ARC domain found in R proteins. Examination of the

corresponding genomic region of the unigene (WMU10848) that does not correspond

to any predicted genes indicated that the genomic region contains no open reading

frames (ORFs), suggesting it is probably a pseudogene. The fact that all 23 NBS-LRR

genes identified in the EST collection are covered by the genome assembly indicated

that the chance is very low of NBS-LRR genes being not covered by the genome

assembly.

S5.3 Watermelon NBS-LRR genes in semi-wild and wild accessions

We checked the presence of the 44 NBS-LRR genes in the genomes of the

semi-wild/wild accessions by aligning the genome resequencing reads to these 44

NBS-LRR genes. We found that all the 44 NBS-LRR genes are present in semi-wild

C. lanatus subsp. mucosospermus accessions and only one gene, Cla012424, is absent

in PI296341-FR, PI482276, and PI482303, all belonging to the wild C. lanatus subsp.

lanatus, but present in PI482326, another C. lanatus subsp. lanatus accession.

S6 Comparative analysis of cucurbit phloem sap and vascular

transcriptomes

S6.1 Identification of phloem sap transcripts

The watermelon and cucumber vascular bundle transcriptomes represent those

mRNAs being expressed in the cambium, the companion cells as well as in the

phloem and xylem parenchyma. The phloem transcriptomes represent mRNAs that

are contained in the phloem sap collected from these plants. This sap represents the

phloem translocation stream that is carried by the enucleate sieve tube system. Thus,

the phloem sap transcriptome represents a unique population of transcripts present

Nature Genetics: doi:10.1038/ng.2470

Page 15: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

12

within mature enucleate sieve elements that are generated by their nucleate

companion cells. For our studies, the phloem transcriptomes contain only those

transcripts that were found to be enriched with at least two-fold higher levels in the

phloem sap compared to the vascular tissues. Excluding transcripts that were lower

than this 2-fold enrichment removes those that, potentially, could have contaminated

the phloem sap from the surrounding tissues. In the present study, we detected ~1000

transcripts in the phloem sap of cucumber and watermelon. This mRNA population is

about 10 times smaller than that for the complex vascular tissues, and this is to be

expected as the mature sieve elements form a highly specialized conduit for nutrient

delivery.

S6.2 Comparative analysis of phloem sap and vascular transcripts

We compared vascular and phloem sap transcripts between watermelon and cucumber,

respectively, using BLATSP with an E-value cutoff of 1e-5. We also compared the

whole gene sets between watermelon and cucumber (Supplementary Table 27). At

all E-value cutoffs tested, vascular transcripts always had significantly more pairs

between watermelon and cucumber than was found for the comparison of the whole

gene sets for these two cucurbits. A converse situation existed in terms of phloem sap

transcripts for watermelon and cucumber; here, there were significantly less pairs than

for the whole gene sets (p values of all chi-square tests were <0.0001). This analysis

indicated that transcripts in vascular bundles were highly conserved between

watermelon and cucumber, whereas those in the phloem translocation stream

(sampled by the sap) were highly divergent.

S7 Regulation of watermelon fruit development and quality

S7.1 Model of sugar accumulation in watermelon fruit flesh

Through strand-specific RNA-seq analysis, we identified a total of 13 sugar metabolic

enzyme coding genes that were differentially expressed during flesh development and

also between flesh and rind tissues (at least two-fold differential expression and

Nature Genetics: doi:10.1038/ng.2470

Page 16: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

13

FDR<0.01). These 13 genes were distributed in seven enzyme categories

(Supplementary Table 39). AGA (α-galactosidase) and IAI (insoluble acid invertase)

are known to determine plant sink strength by regulating photosynthate unloading into

fruits25-27

. The up-regulation of an AGA gene, Cla006123, and an IAI gene, Cla020872,

during watermelon flesh development, and their significantly higher expression levels

in high-sugar fruit flesh than in low-sugar rind indicated their important roles in

regulating the unloading and utilization of the translocated RFOs (raffinose family of

oligosaccharides). The substances of RFOs, sucrose and galactose, are metabolized

and utilized for further energy metabolism in cytoplasm via a complicated enzyme

mediated network. A vacuole SAI (soluble acid invertase) gene, Cla002328, was

found to be down-regulated accompanying with the sugar accumulation in fruit flesh.

A significant negative correlation has been observed between the SAI activity and

sucrose accumulation in melon28

, tomato29

and sugarcane30

. The decreased expression

of Cla002328 will reduce the sucrose catabolism rate and lead to the high

concentration sucrose accumulation in vacuole, the main organelle storing sugar in

watermelon fruit flesh. A UGGP (UDP-galactose/glucose pyrophosphorylase) gene,

Cla013902, was up-regulated during watermelon flesh development, indicating its key

role in fruit sink metabolism, based on the function of UGGP in catalyzing the

synthesis of UDP-Glucose/UDP-Galactose reported in melon31

. Finally, the

differentially expressed NI (neutral invertase) gene, Cla021809, SPS (sucrose

phosphate synthase) gene, Cla011923 and UGE (UDP-glucose 4-epimerase) genes,

Cla009857 and Cla012809, can contribute to the fruit sucrose catabolism and

utilization32

, maintaining sucrose metabolism cycle33

and providing substances for

cell wall biosynthesis and growth34

during watermelon fruit flesh development.

Sugar transporters are necessary for sugar transmembrane transportation and

partitioning35

. A total of 14 sugar transporter genes were found to be differentially

expressed accompanying with sugar accumulation in flesh tissue and between

high-sugar fruit flesh and low-sugar rind (Supplementary Table 40). We suppose that

they play important roles in sugar accumulation in the fruit flesh of watermelon, same

as in fruit of tomato36

and grapevine37

.

Nature Genetics: doi:10.1038/ng.2470

Page 17: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

14

In summary, our model demonstrated a novel genomic insight into the complex

gene network involved in sugar unloading, metabolism and partitioning during sugar

accumulation in watermelon fruit flesh.

S7.2 Identification and classification of transcription factors

Transcription factors (TFs) were identified and classified into different families using

the iTAK program (http://bioinfo.bti.cornell.edu/tool/itak). The program first

compared watermelon protein sequences against the pfam domain database22

using

the HMMER3 program (http://hmmer.janelia.org). Proteins containing corresponding

DNA-binding domains were identified as TFs and further classified into different

families, based on the rules described in Perez-Rodriguez38

. The same pipeline was

also applied to other plant genomes, including cucumber, Arabidopsis, rice, grape,

poplar, papaya, sorghum, soybean, Brachypodium, maize, apple, cacao, strawberry,

and castor bean. TFs identified from these plant genomes are available at

http://bioinfo.bti.cornell.edu/cgi-bin/itak/db_home.cgi.

Within the watermelon genome, we identified a total of 1,448 putative

transcription factor (TF) genes, distributed in 59 families. The number of identified

TFs is among the lowest in the sequenced plant genomes, though comparable to 1,412

and 1,407 for cucumber and grape, respectively (Supplementary Table 41).

S7.3 Identification of sucrose-controlled upstream open reading frame

(SC-uORF) containing bZIP transcription factors

To identify SC-uORF containing bZIP transcription factors in the watermelon genome,

we first extracted 2 kb sequences that are upstream of translation start sites (ATG) of

each of the 23,440 predicted protein coding genes. These 2 kb upstream sequences

were then translated into protein sequences using the transeq program in the

EMBOSS package (http://emboss.sourceforge.net), with standard codon usage table

and in three forward frames. The resultant peptide sequences were scanned for the

presence of the conserved SC-uORF motif ([I/L/F][L/M/V/S][H/Q/L][S][F][S][V][V]

Nature Genetics: doi:10.1038/ng.2470

Page 18: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

15

[F/Y][L][Y][W/Y][F/T/L][Y][N/V][I/F/V][S]) as reported in previous studies39,40

. Peptide

sequences containing the SC-uORF motif were further manually checked to confirm

that they have both start and stop codons flanking the matched sites. Finally, a total of

four SC-uORF motif contained bZIP transcription factors, Cla014247, Cla022469,

Cla014572 and Cla017361, were identified in the watermelon genome.

S7.4 MADS box genes in watermelon and cucumber genomes

One notable feature of both watermelon and cucumber genomes is that they contain

much fewer MADS-box transcription factors than most of the other sequenced plant

genomes (Supplementary Table 41). Protein sequences of watermelon, cucumber,

and Arabidopsis MADS-box genes, as well as tomato LeMADS-RIN and TAGL1, were

aligned using ClustalW (http://www.clustal.org). The Neighbor-joining phylogenetic

tree of MADS-box proteins was then constructed from the alignment with 1,000

bootstraps. The phylogenetic analysis identified two MADS family clades that appear

to be completely lost in both watermelon and cucumber genomes (Supplementary

Fig. 15). The first includes an FLC (FLOWERING LOCUS C) and five related MAF

(MADS AFFECTING FLOWERING) genes, which are negative regulators of floral

development41-43

. Absence of these genes in watermelon and cucumber genomes

implies that these two organisms may have different pathways for regulating floral

development, possibly related to the monoecious nature of their flowers. The second

clade that is absent from these two genomes contains a large group of 18 Type I

Arabidopsis MADS-box TFs, whose functions remain unclear although they are

reported to evolve and be lost more quickly during evolution41

.

Nature Genetics: doi:10.1038/ng.2470

Page 19: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

16

Supplementary References

1. Li, R. et al. De novo assembly of human genomes with massively parallel short read

sequencing. Genome Res. 20, 265–272 (2010).

2. Joobeur, T. et al. Construction of a watermelon BAC library and identification of SSRs

anchored to melon or Arabidopsis genomes. Theor. Appl. Genet. 112, 1553–1562 (2006).

3. Kent, W.J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

4. Edgar, R.C. & Myers, E.W. PILER: identification and classification of genomic repeats.

Bioinformatics 21, 152–158 (2005).

5. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in large

genomes. Bioinformatics 21, 351–358 (2005).

6. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic Acids Res. 35, 265–268 (2007).

7. Hu, T.T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size

change. Nat. Genet. 43, 476–481 (2011).

8. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M. & Bairoch, A. UniProtKB/Swiss-Prot.

Methods Mol. Biol. 406, 89–112 (2007).

9. Mulder, N. & Apweiler, R. InterPro and InterProScan: tools for protein sequence classification

and comparison. Methods Mol. Biol 396, 59–70 (2007).

10. The Tomato Genome Sequencing Consortium. The tomato genome sequence provides insights

into fleshy fruit evolution. Nature 485, 635–641 (2012)

11. Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer RNA

genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

12. Lowe, T.M. & Eddy, S.R. A computational screen for methylation guide snoRNAs in yeast.

Science 283, 1168–1171 (1999).

13. Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic

Acids Res. 33, 121–124 (2005).

14. Tamura, K. et al. MEGA5: Molecular Evolutionary Genetics Analysis using Maximum

Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol. Biol. Evol 28,

2731–2739 (2011).

15. Murat, F. et al. Ancestral grass karyotype reconstruction unravels new mechanisms of genome

shuffling as a source of plant evolution. Genome Res. 20, 1545–1557 (2010).

16. Salse, J. et al. Reconstruction of monocotelydoneous proto-chromosomes reveals faster

evolution in plants than in animals. Proc. Natl. Acad. Sci. USA 106, 14908–14913 (2009).

17. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform.

Bioinformatics 25, 1754–1760 (2009).

18. Clark, R.M. et al. Common sequence polymorphisms shaping genetic diversity in Arabidopsis

thaliana. Science 317, 338–342 (2007).

19. Lam, H.M. et al. Resequencing of 31 wild and cultivated soybean genomes identifies patterns

of genetic diversity and selection. Nat. Genet. 42, 1053–1059 (2010).

20. Xu, X. et al. Resequencing 50 accessions of cultivated and wild rice yields markers for

identifying agronomically important genes. Nat. Biotech. 30, 105–111 (2012).

21. Tang, H., Peng, J., Wang, P. & Risch, N.J. Estimation of individual admixture: analytical and

study design considerations. Genet. Epidemiol. 28, 289–301 (2005).

Nature Genetics: doi:10.1038/ng.2470

Page 20: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

17

22. Finn, R.D. et al. The Pfam protein families database. Nucleic Acids Res. 38, 211–222 (2010).

23. Lupas, A., Van Dyke, M. & Stock, J. Predicting coiled coils from protein sequences. Science

252, 1162–1164 (1991).

24. Guo, S. et al. Characterization of transcriptome dynamics during watermelon fruit

development: sequencing, assembly, annotation and gene expression profiles. BMC Genomics

12, 454 (2011).

25. Godt, D.E. & Roitsch, T. Regulation and tissue-specific distribution of mRNAs for three

extracellular invertase isoenzymes of tomato suggests an important function in establishing

and maintaining sink metabolism. Plant Physiol. 115, 273–282 (1997).

26. Gao, Z. & Schaffer, A.A. A novel alkaline alpha-galactosidase from melon fruit with a

substrate preference for raffinose. Plant Physiol. 119, 979–988 (1999).

27. Carmi, N. et al. Cloning and functional expression of alkaline alpha-galactosidase from melon

fruit: similarity to plant SIP proteins uncovers a novel family of plant glycosyl hydrolases.

Plant J. 33, 97–106 (2003).

28. Schaffer, A.A., Aloni, B. & Fogelman, E. Sucrose metabolism and accumulation in

developing fruit of Cucumis. Phytochemistry 26, 1883–1887 (1987).

29. Yelle, S., Chetelat, R.T., Dorais, M., Deverna, J.W. & Bennett, A.B. Sink Metabolism in

tomato fruit: IV. genetic and biochemical analysis of sucrose accumulation. Plant Physiol. 95,

1026–1035 (1991).

30. Zhu, Y.J., Komor, E. & Moore, P.H. Sucrose accumulation in the sugarcane stem is regulated

by the difference between the activities of soluble acid invertase and sucrose phosphate

synthase. Plant Physiol. 115, 609–616 (1997).

31. Dai, N. et al. Cloning and expression analysis of a UDP-galactose/glucose pyrophosphorylase

from melon fruit provides evidence for the major metabolic pathway of galactose metabolism

in raffinose oligosaccharide metabolizing plants. Plant Physiol. 142, 294–304 (2006).

32. Roitsch, T. & González, M.-C. Function and regulation of plant invertases: sweet sensations.

Trends Plant Sci. 9, 606–613 (2004).

33. Nguyen-Quoc, B. & Foyer, C.H. A role for 'futile cycles' involving invertase and sucrose

synthase in sucrose metabolism of tomato fruit. J. Exp. Bot. 52, 881–889 (2001).

34. Rosti, J. et al. UDP-glucose 4-epimerase isoforms UGE2 and UGE4 cooperate in providing

UDP-galactose for cell wall biosynthesis and growth of Arabidopsis thaliana. Plant Cell 19,

1565–1579 (2007).

35. Slewinski, T.L. Diverse Functional roles of monosaccharide transporters and their homologs

in vascular plants: a physiological perspective. Mol. Plant 4, 641–662 (2011).

36. Milner, I.D., Ho, L.C. & Hall, J.L. Properties of proton and sugar transport at the tonoplast of

tomato (Lycopersicon esculentum) fruit. Physiol. Plant 94, 399–410 (1995).

37. Afoufa-Bastien, D. et al. The Vitis vinifera sugar transporter gene family: phylogenetic

overview and macroarray expression profiling. BMC Plant Biology 10, 245 (2010).

38. Perez-Rodriguez, P. et al. PlnTFDB: updated content and new features of the plant

transcription factor database. Nucleic Acids Res. 38, 822–827 (2010).

39. Wiese, A., Elzinga, N., Wobbes, B. & Smeekens, S. A conserved upstream open reading frame

mediates sucrose-induced repression of translation. Plant Cell 16, 1717–1729 (2004).

40. Thalor, S.K. et al. Deregulation of sucrose-controlled translation of a bZIP-type transcription

factor results in sucrose accumulation in leaves. PLoS ONE 7, e33111 (2012).

Nature Genetics: doi:10.1038/ng.2470

Page 21: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

18

41. Michaels, S.D. & Amasino, R.M. FLOWERING LOCUS C encodes a novel MADS domain

protein that acts as a repressor of flowering. Plant Cell 11, 949–956 (1999).

42. Ratcliffe, O.J., Nadzan, G.C., Reuber, T.L. & Riechmann, J.L. Regulation of flowering in

Arabidopsis by an FLC homologue. Plant Physiol. 126, 122–132 (2001).

43. Ratcliffe, O.J., Kumimoto, R.W., Wong, B.J. & Riechmann, J.L. Analysis of the Arabidopsis

MADS AFFECTING FLOWERING gene family: MAF2 prevents vernalization by short

periods of cold. Plant Cell 15, 1159–1169 (2003).

Nature Genetics: doi:10.1038/ng.2470

Page 22: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

19

Supplementary Figures

Supplementary Figure 1. 17-mer depth distribution of the Illumina GA reads. Reads from

libraries with clone insert sizes of 200 bp were used for analysis. A total of 4,639,223,061

17-mers were obtained, and the peak depth was 11. Watermelon genome size was estimated

based on the formula: Genome size = (Total number of kmer)/(Position of peak depth) =

4,639,223,061 / 11 = 421.75 Mb

Nu

mb

er

of

17

-mer

s

Depth0 10 20 30 40 50

30 x 106

25 x 106

20 x 106

15 x 106

10 x 106

5 x 106

0

Nature Genetics: doi:10.1038/ng.2470

Page 23: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

20

Supplementary Figure 2. Distribution of unassembled reads on watermelon chromosomes.

The color scale bar represents densities of the corresponding elements. TEs: transposable

elements; Unassembled: unassembled reads.

Nature Genetics: doi:10.1038/ng.2470

Page 24: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

21

JN402338

JN402339

JX027061

JX027062

Supplementary Figure 3. Genome coverage evaluated by four fully sequenced BACs

(GenBank accession numbers: JN402338, JN402339, JX027061 and JX027062).

Nature Genetics: doi:10.1038/ng.2470

Page 25: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

22

a b

c d

Supplementary Figure 4. Effects of sequence depth and large-insert reads on watermelon

genome assembly. (a) Scaffold N50 size and (b) total assembled genome size patterns of

assemblies with reads representing different sequence depths. (c) Scaffold N50 size and (d)

total assembled genome size patterns of assemblies with reads of different insert sizes (see

Supplementary Note).

0

5

10

15

20

25

0 20 40 60 80 100

Scaff

old

N50 l

en

gth

(K

b)

Data depth (X)

Scaffold N50

272

274

276

278

280

282

284

286

288

290

292

0 20 40 60 80 100

To

tal le

ng

th (

Mb

)

Data depth (X)

Total length

0

500

1,000

1,500

2,000

2,500

3,000

0 1 2 3 4 5 6 7

N50 l

en

gth

(K

b)

Rank

Scaffold N50

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7

To

tal le

ng

th (

Mb

)

Rank

Total length

Nature Genetics: doi:10.1038/ng.2470

Page 26: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

23

Supplementary Figure 5. Distribution of divergence rate for each type of TEs in the

watermelon genome. The divergence rate was calculated between the identified TE elements in

the genome and the consensus sequence in the TE library.

0

0.2

0.4

0.6

0.8

1

0 10 20 30

sequence divergence rate (%)

Pe

rce

nta

ge

of

ge

no

me

(%

)LINE

DNA

LTR

SINE

Nature Genetics: doi:10.1038/ng.2470

Page 27: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

24

Supplementary Figure 6. Distribution of TE insertion time of watermelon and cucumber.

MYA: million years ago.

watermelon cucumber

Nature Genetics: doi:10.1038/ng.2470

Page 28: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

25

Supplementary Figure 7. Heat map of the watermelon genome component distribution in the

eleven chromosomes. RTs, retrotransposons; LTR_RT, long terminal repeat retrotransposon;

DNA-TEs, DNA transposons.

Nature Genetics: doi:10.1038/ng.2470

Page 29: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

26

Supplementary Figure 8. rDNA pattern in watermelon genomes. Fluorescence in situ

hybridization (FISH) analyses were performed using 45S and 5S rDNAs as probes on genomes

of C. lanatus subsp. vulgaris (a), C. lanatus subsp. mucosospermus (b) and C. lanatus subsp.

lanatus (c). Chromosomes, 45S rDNAs and 5S rDNAs are dyed with blue, green and red colors,

respectively. Illustrations of rDNA distributions on the 11 watermelon chromosomes are

provided for C. lanatus subsp. vulgaris and C. lanatus subsp. mucosospermus (d) and C.

lanatus subsp. lanatus (e). Green and red dots represent 45S and 5S rDNAs, respectively.

a b c

d e

Nature Genetics: doi:10.1038/ng.2470

Page 30: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

27

Supplementary Figure 9. Time inference of the watermelon whole genome duplication

(WGD) event and the divergence time estimation of the watermelon/cucumber speciation.

Nature Genetics: doi:10.1038/ng.2470

Page 31: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

28

97103 JX-2 JLM JXF

RZ-901 XHBFGM Black Diamond Calhoun Gray

Sugarlee Sy-904304 RZ-900 PI482271

PI500301 PI189317 PI595203 PI249010

PI248178 PI482276 PI482303 PI296341-FR

PI482326

Supplementary Figure 10. Fruits of watermelon accessions used for genome sequencing and

resequencing.

Nature Genetics: doi:10.1038/ng.2470

Page 32: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

29

Supplementary Figure 11. Distribution of disease resistance genes on watermelon

chromosomes.

Nature Genetics: doi:10.1038/ng.2470

Page 33: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

30

Supplementary Figure 12. Venn diagram illustrating the extent to which the

watermelon, cucumber and pumpkin phloem transcriptomes contain common and

unique gene sets. The three cucurbit phloem sap transcriptomes were analyzed by

BLAST using an E-value cutoff of 1e-10 (see Supplementary Table 27). Data are

presented as percentages of the total number of transcripts for each cucurbit species.

The small number of unique pumpkin phloem transcripts (12.5%) reflects the absence

of a draft genome for this species; this compromised the identification of the full

phloem gene set in pumpkin.

Nature Genetics: doi:10.1038/ng.2470

Page 34: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

31

Supplementary Figure 13. Model of sugar delivery to and metabolism within cells of

the watermelon fruit. Arrows indicate flux directions. Yellow block: No difference in

expression. For differentially expressed genes, the lowest and highest levels of

expression are represented by blue and red blocks, respectively. A green box indicates

higher expression in flesh than in rind, whereas a dark blue box indicates lower

expression in flesh than in rind. SE-CCC: sieve element companion cell complex; AGA:

α-galactosidase; GALK: galactokinase; UGGP: UDP-galactose/glucose

pyrophosphorylase; UGE: UDP-glucose 4-epimerase; UGP: UDP-glucose

pyrophosphorylase; PGM: phosphoglucomutase; HK: hexokinase; NI: neutral

invertase; IAI: insoluble acid invertase; SAI: soluble acid invertase; SUS: sucrose

synthase; FRK: fructokinase; PGI: phosphoglucoisomerase; SPS: sucrose phosphate

synthase; SPP: sucrose phosphate phosphatase; OPPP: oxidative pentose phosphate

pathway.

Nature Genetics: doi:10.1038/ng.2470

Page 35: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

32

Supplementary Figure 14. Watermelon SC-uORF containing bZIP genes. (a) Phylogenetic

relationship of Arabidopsis bZIP proteins and bZIP proteins from other plant species

containing SC-uORF including four from watermelon. Accession or locus numbers: AtbZIP1

(At5g49450), AtbZIP2 (At2g18160), AtbZIP11 (At4g34590), AtbZIP44 (At1g75390),

AtbZIP53 (At3g62420), Am910 (Y13675), Am911 (Y13676), BZI-2 (AY045570), BZI-4

(AY045572), LIP19 (X57325), mLIP15 (D26563), OBF1 (X62745), rdLIP (AB015187),

TBZ17 (D63951), TBZF (AB032478), and OsOBF1 (AB185280). Subfamily of SC-uORF

containing genes is indicated by dotted-square line. The four watermelon genes are highlighted

with the one differentially expressed during fruit flesh development highlighted in red. (b)

Alignment of the SC-uORF of bZIP proteins.

b

Nature Genetics: doi:10.1038/ng.2470

Page 36: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

33

Supplementary Figure 15. MADS-box proteins from watermelon. Phylogenetic tree of

MADS-box family proteins of watermelon (pink dots), cucumber (green dots), and Arabidopsis

(yellow dots). Tomato LeMADS-RIN and TAGL1, and strawberry FaMADS-RIN (red dots)

were also included in the tree.

Nature Genetics: doi:10.1038/ng.2470

Page 37: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

34

Supplementary Figure 16. Citrulline content in watermelon fruit flesh and rind. Data

represents mean ± SE of two biological replicates. DAP: days after pollination.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10 DAP 18 DAP 26 DAP 34 DAP

Citru

llin

e c

onte

nt (m

g g

-1 F

W)

flesh

rind

Nature Genetics: doi:10.1038/ng.2470

Page 38: The draft genome of watermelon (Citrullus lanatus …...The draft genome of watermelon (Citrullus l anatus) and resequencing of 20 diverse accessions Shaogui Guo 1,2,17, Jianguo Zhang

35

Supplementary Figure 17. Citrulline metabolic pathway in watermelon. Expanded gene

families in watermelon compared to Arabidopsis are highlighted in yellow while genes

differentially expressed during watermelon fruit development are highlighted in green.

Nature Genetics: doi:10.1038/ng.2470