Transcriptome sequencing and de novo characterization of ... · Chandrasekharpur, Bhubaneswar,...

16
1 3 Mol Genet Genomics DOI 10.1007/s00438-016-1233-9 ORIGINAL ARTICLE Transcriptome sequencing and de novo characterization of Korean endemic land snail, Koreanohadra kurodana for functional transcripts and SSR markers Se Won Kang 1 · Bharat Bhusan Patnaik 1,2 · Hee‑Ju Hwang 1 · So Young Park 1 · Jong Min Chung 1 · Dae Kwon Song 1 · Hongray Howrelia Patnaik 1 · Jae Bong Lee 3 · Changmu Kim 4 · Soonok Kim 4 · Hong Seog Park 5 · Yeon Soo Han 6 · Jun Sang Lee 7 · Yong Seok Lee 1 Received: 10 February 2016 / Accepted: 25 July 2016 © Springer-Verlag Berlin Heidelberg 2016 the public databases. The direction of the unigenes to func- tional categories was determined using COG, GO, KEGG, and InterProScan protein domain search. The GO analysis search resulted in 22,967 unigenes (12.02 %) being cate- gorized into 40 functional groups. The KEGG annotation revealed that metabolism pathway genes were enriched. The most prominent protein motifs include the zinc finger, ribonuclease H, reverse transcriptase, and ankyrin repeat domains. The simple sequence repeats (SSRs) identified from >1 kb length of unigenes show a dominancy of dinu- cleotide repeat motifs followed with tri- and tetranucleotide motifs. A number of unigenes were putatively assessed to belong to adaptation and defense mechanisms including heat shock proteins 70, Toll-like receptor 4, AMP-activated protein kinase, aquaporin-2, etc. Our data provide a rich source for the identification and functional characterization of new genes and candidate polymorphic SSR markers in K. kurodana. The availability of transcriptome informa- tion (http://bioinfo.sch.ac.kr/submission/) would promote the utilization of the resources for phylogenetics study and genetic diversity assessment. Abstract The Korean endemic land snail Koreanohadra kurodana (Gastropoda: Bradybaenidae) found in humid areas of broadleaf forests and shrubs have been consid- ered vulnerable as the number of individuals are declining in recent years. The species is poorly characterized at the genomic level that limits the understanding of functions at the molecular and genetics level. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset of visceral mass tissue of K. kurodana by the Illumina paired-end sequencing tech- nology. Over 234 million quality reads were assembled to a total of 315,924 contigs and 191,071 unigenes, with an average and N50 length of 585.6 and 715 bp and 678 and 927 bp, respectively. Overall, 36.32 % of the unigenes found matches to known protein/nucleotide sequences in Communicated by S. Hohmann. S. W. Kang and B. B. Patnaik contributed equally to this work. Electronic supplementary material The online version of this article (doi:10.1007/s00438-016-1233-9) contains supplementary material, which is available to authorized users. * Yong Seok Lee [email protected] 1 Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungcheongnam-do 31538, Korea 2 Trident School of Biotech Sciences, Trident Academy of Creative Technology (TACT), Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha 751024, India 3 Korea Zoonosis Research Institute (KOZRI), Chonbuk National University, 820-120 Hana-ro, Iksan, Jeollabuk-do 54528, Korea 4 National Institute of Biological Resources, 42, Hwangyeong-ro, Seo-gu, Incheon 22689, Korea 5 Research Institute, GnC BIO Co., LTD., 621-6 Banseok-dong, Yuseong-gu, Daejeon 34069, Korea 6 College of Agriculture and Life Science, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwangju 61186, Korea 7 Institute of Environmental Research, Kangwon National University, 1 Kangwondaehak-gil, Chuncheon-si, Gangwon-do 243341, Korea

Transcript of Transcriptome sequencing and de novo characterization of ... · Chandrasekharpur, Bhubaneswar,...

1 3

Mol Genet GenomicsDOI 10.1007/s00438-016-1233-9

ORIGINAL ARTICLE

Transcriptome sequencing and de novo characterization of Korean endemic land snail, Koreanohadra kurodana for functional transcripts and SSR markers

Se Won Kang1 · Bharat Bhusan Patnaik1,2 · Hee‑Ju Hwang1 · So Young Park1 · Jong Min Chung1 · Dae Kwon Song1 · Hongray Howrelia Patnaik1 · Jae Bong Lee3 · Changmu Kim4 · Soonok Kim4 · Hong Seog Park5 · Yeon Soo Han6 · Jun Sang Lee7 · Yong Seok Lee1

Received: 10 February 2016 / Accepted: 25 July 2016 © Springer-Verlag Berlin Heidelberg 2016

the public databases. The direction of the unigenes to func-tional categories was determined using COG, GO, KEGG, and InterProScan protein domain search. The GO analysis search resulted in 22,967 unigenes (12.02 %) being cate-gorized into 40 functional groups. The KEGG annotation revealed that metabolism pathway genes were enriched. The most prominent protein motifs include the zinc finger, ribonuclease H, reverse transcriptase, and ankyrin repeat domains. The simple sequence repeats (SSRs) identified from >1 kb length of unigenes show a dominancy of dinu-cleotide repeat motifs followed with tri- and tetranucleotide motifs. A number of unigenes were putatively assessed to belong to adaptation and defense mechanisms including heat shock proteins 70, Toll-like receptor 4, AMP-activated protein kinase, aquaporin-2, etc. Our data provide a rich source for the identification and functional characterization of new genes and candidate polymorphic SSR markers in K. kurodana. The availability of transcriptome informa-tion (http://bioinfo.sch.ac.kr/submission/) would promote the utilization of the resources for phylogenetics study and genetic diversity assessment.

Abstract The Korean endemic land snail Koreanohadra kurodana (Gastropoda: Bradybaenidae) found in humid areas of broadleaf forests and shrubs have been consid-ered vulnerable as the number of individuals are declining in recent years. The species is poorly characterized at the genomic level that limits the understanding of functions at the molecular and genetics level. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset of visceral mass tissue of K. kurodana by the Illumina paired-end sequencing tech-nology. Over 234 million quality reads were assembled to a total of 315,924 contigs and 191,071 unigenes, with an average and N50 length of 585.6 and 715 bp and 678 and 927 bp, respectively. Overall, 36.32 % of the unigenes found matches to known protein/nucleotide sequences in

Communicated by S. Hohmann.

S. W. Kang and B. B. Patnaik contributed equally to this work.

Electronic supplementary material The online version of this article (doi:10.1007/s00438-016-1233-9) contains supplementary material, which is available to authorized users.

* Yong Seok Lee [email protected]

1 Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungcheongnam-do 31538, Korea

2 Trident School of Biotech Sciences, Trident Academy of Creative Technology (TACT), Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha 751024, India

3 Korea Zoonosis Research Institute (KOZRI), Chonbuk National University, 820-120 Hana-ro, Iksan, Jeollabuk-do 54528, Korea

4 National Institute of Biological Resources, 42, Hwangyeong-ro, Seo-gu, Incheon 22689, Korea

5 Research Institute, GnC BIO Co., LTD., 621-6 Banseok-dong, Yuseong-gu, Daejeon 34069, Korea

6 College of Agriculture and Life Science, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwangju 61186, Korea

7 Institute of Environmental Research, Kangwon National University, 1 Kangwondaehak-gil, Chuncheon-si, Gangwon-do 243341, Korea

Mol Genet Genomics

1 3

Keywords Koreanohadra kurodana · Transcriptomics · De novo assembly · Functional annotation · Simple sequence repeats

Introduction

Gastropods represent one of the most successful classes of living molluscs having greatly reduced shells or a com-plete absence of shells. From an evolutionary standpoint it is perplexing, though on account of parallel development of such an adaptation trait, it relates to a strong advantage for colonization on land. The Gastropod class of molluscs has invaded the land with more than 62,000 described liv-ing species, comprising about 80 % of living molluscs. The land snails are the most numerous with close to 35,000 described species, mostly inhabiting the tropics. Still, many species remain undescribed mainly due to their minute sizes and under-exploration from extreme terrestrial eco-zones. The land snail family Bradybaenidae, identified on the basis of the presence of dart sac and dart-related organs, are a relatively younger taxon, to the closely related Cama-enidae family (Hirano et al. 2014). The bradybaenids are extremely specious with distribution in Northwest Amer-ica and Europe, Southeast and East Asia including Japan, Korea, Taiwan, Indonesia, Philippines, and China (Chen and Zhang 2004; Wu et al. 2007; Kuznik-Kowalska et al. 2013; Hirano et al. 2014; Huang et al. 2014). A study by Kwon et al. (1993) classified the Korean Bradybaenidae snails into 24 species. Among the land snail fauna of Jeju Island, species belonging to family Bradybaenidae are one of the largest, with eight reported species (Noseworthy et al. 2007). Under bradybaenid snails, Koreanohadra has been classified as a genus of its own by Min et al. (2004) and accorded synonyms of Fruticicola and Karaftohelix by Schileyko (2004) and Sysoev and Schileyko (2009), respec-tively. Koreanohadra kurodana is a bradybaenid land snail species endemic to Korea. The species is found at humid areas in broadleaf forests with shrubs and known from Gangwon-do and Gyeonggi-do provinces of Korea. Chro-mosome numbers and karyotype of K. kurodana reported by Lee and Kwon (1993) suggests that it possesses 11 pairs of metacentric chromosomes, 17 sub-metacentric and one telocentric chromosome. Such an analysis was referred while reporting the 29 chromosome pairs in another species from the same genus, Koreanohadra koreana (Park 2011).

Koreanohadra kurodana has been assessed as vulner-able (≤10 species occupied within <2000 km2) by Korean Red List of Threatened Species, 2014. A continual decline has been observed in the area of occupancy of the species due to predation by natural enemies, indiscriminate collec-tion for ornamental use, change and loss of natural habi-tats caused by forest development. The species has been

threatened for extinction with lack of regional conservation measures. With genomic information lacking for the spe-cies, an evaluation of population genetic structure, genetic diversity, and possible evolutionary responses to environ-mental perturbations and climate change are not feasi-ble. The National Centre for Biotechnology Information (NCBI) database only shows the sequence of mitochondrial protein cytochrome c oxidase I (CO-1) under the synonym name Fruticicola koreana.

Next-generation sequencing (NGS) such as Illumina sequencing by synthesis (SBS) allows for massive char-acterization of the transcriptome by generation of mil-lions of short reads. The short reads produced provide a more sensitive application for the detection of differential expression compared to conventional gene arrays (Marioni et al. 2008; Collins et al. 2008). The Illumina platform has become increasingly cost-effective and been successfully used to characterize the transcriptome of non-model spe-cies with greater precision (Wang et al. 2010, 2012; Meng et al. 2013; Yue et al. 2015; Patnaik et al. 2015; Hwang et al. 2016). The Illumina NGS technology has also greatly benefitted with the development of algorithms and experi-ments for de novo assembly (Martin and Wang 2011; Ekb-lom and Wolf 2014). The Illumina sequencing technology has been utilized for the transcriptome characterization of molluscs and annotation of genes suggested to participate in various metabolic, cellular and immune pathways (Cas-tellanos-Martinez et al. 2014; Deng et al. 2014; Patnaik et al. 2016a). Furthermore, molecular markers such as sin-gle nucleotide polymorphism (SNP) and simple sequence repeats (SSR) have been detected from de novo assembled sequences of Illumina-based transcriptome of molluscs (Franchini et al. 2011; Patnaik et al. 2016b). In this study, we have used Illumina HiSeq 2500 sequencing to gener-ate a transcriptome of K. kurodana. We have conducted de novo assembly of the transcriptome and used bioinfor-matics analysis to annotate the functional direction of the assembled sequences. In doing so, we have suggested the predicted functions based on database classification struc-ture. This is the first large-scale transcriptome information for K. kurodana that could be useful in providing valuable resources towards functional gene discovery and devel-opment of molecular markers for the conservation of the species.

Materials and methods

Species collection and RNA extraction

Specimens of K. kurodana were collected from the moun-tain near to Maha-ri, Mitan-myeon, Pyeongchang-gun, Gangwon-do, Republic of Korea in August 2014. No

Mol Genet Genomics

1 3

prior permission and ethical clearance was required as the species is endemic to Korea and enlisted under ‘vulner-able’ category of Korean Red List KORED assessments. As per the regulations of National Institute of Biological Resources (NIBR), Korea, only the collection of ‘endan-gered’ species class I or class II needs a permission certifi-cate. The collected specimens were housed in the labora-tory within purpose-built enclosures at room temperature, and provided with water and carrot diet ad libitum. Nec-essary care was taken during the period of laboratory acclimatization and dissection of visceral mass tissue fol-lowed the International Guiding Principles of Biomedi-cal Research Involving Animals (1985) (http://www.ncbi.nlm.nih.gov/books/NBK25438/). Visceral mass tissue was dissected from three living snails, and was stored in RNA-later stabilization reagent (Qiagen) that preserves the integ-rity of RNA in tissue samples. Total RNA was extracted with TRIzol® reagent (Invitrogen, Carlsbad, CA, USA) using manufacturer protocols. Residual DNA (if any) was removed using RNase-free DNase I (Qiagen) by incubation at 37 °C for 30 min. Following extraction, the RNA quality was assessed on a 1.2 % formaldehyde agarose gel electro-phoresis and quantified using NanoDrop Spectrophotom-eter (Thermo Scientific, Wilmington, DE) and Agilent 2100 Bioanalyzer (Agilent Technologies).

Illumina HiSeq 2500 sequencing and de novo transcriptome analysis

Paired-end cDNA library was prepared using the mRNA-seq sample preparation kit. For the same, poly (A) mRNA was isolated using oligo (dT) beads and fragmented using the fragmentation buffer. The cleaved mRNA was used to synthesize first strand cDNA using random-hexamer prim-ers and reverse transcriptase (Invitrogen). RNase H (Inv-itrogen) and DNA polymerase I (New England BioLabs, Ipswich, MA, USA) were used to prime the second-strand cDNA. The double-stranded cDNA, thus formed was end-repaired using T4 DNA polymerase, the Klenow fragment, and T4 polynucleotide kinase (New England BioLabs). Illumina paired-end adapters were ligated to the end-repaired cDNA and were sequenced at GnC Bio Company, Daejeon, Korea, using an Illumina HiSeq 2500 sequencing platform (2 × 100 bp paired-ends). The sequencing reads were pre-processed for adapter removal using Cutadapt program (http://code.google.com/p/cutadapt/) with default parameters. After filtering the adapter-only sequences, low-quality reads (≥20) were removed using Phred (http://www.phrap.com/). GC content bias in the first 13 bases of Illumina reads, possibly introduced through random-hex-amer priming were also removed to get high quality reads.

The reads, thus, obtained were assembled using Trinity software with default options (minimum allowed length of 200 bp) (Grabherr et al. 2011). A K-mer length of 25 was chosen that seems appropriate as smaller K-mers lead to complex de Bruijin graphs. Trinity de novo program used extensively as an assembly algorithm compiles the sequence data into a number of de Bruijin graphs that ulti-mately reports transcripts in its final form. The TIGR Gene Indices Clustering Tool (TGICL) (Pertea et al. 2003) was used to assemble unigene sequences (without Ns that could not be extended on either end).

Annotation of unigene sequences

The non-redundant unigenes were annotated to public pro-tein sequence repositories using BLASTX at an E value cutoff of 1E−5. The databases for annotation include Pro-tostome (PANM-DB), Clusters of Orthologous Groups (COG) (http://www.ncbi.nlm.nih.gov/COG/), Swiss-Prot, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2010). The Uni-gene DB resource was also utilized for the annotation of assembled sequences using BLASTN. PANM-DB regis-ters the protein sequence information of species belong-ing to Mollusca, Arthropoda, and Nematoda phylum in a multi-FASTA format. PANM-DB is readily accessible in a web-based interface of the Malacological Society of Korea (http://malacol.or.kr/blast/aminoacid.html) (Kang et al. 2015). The Venn diagram plot constructed using Venny 2.0 (http://bioinfogp.cnb.csic.es/tools/venny/) was used for the representation of sequence similarity using PANM, Uni-gene, and COG DB. The BLAST2GO software (Ashburner et al. 2000; Conesa et al. 2005) was utilized under default parameters for InterProScan protein domain analysis, GO functional and KEGG pathway mapping.

Discovery of SSR markers

We screened unigenes of length >1000 bp for SSR type markers using the MISA (MicroSAtellite; http://pgrc.ipk-gatersleben.de/misa, version 1.0) software. We selected for di-, tri-, tetra-, penta-, and hexanucleotide repeats with six, five, four, four, and four minimum iterations, respectively. Other higher repeat motifs were not considered due to their restricted occurrence within the selection criteria. Mononu-cleotides were not considered due to frequent homopolymer generation in Illumina sequencing platform. Subsequently, we designed a set of PCR primers using the BatchPrimer 3 program (You et al. 2008) that flanks the SSR microsatel-lites. Considering the utilization of the same in polymor-phism studies we used the following criteria for primer

Mol Genet Genomics

1 3

identification: dinucleotides with ten or more iterations and tri-/tetranucleotide repeats with seven or more iterations.

Results

High throughput sequencing of the visceral mass transcriptome

The Illumina sequencing of visceral mass samples from the bradybaenid snail, K. kurodana, generated 238,755,412 paired-end reads with 30,083.18 Mb of sequence data. The average length of raw reads were 126 bp with a GC% of 47.07. The raw read sequence data in FASTQ format was deposited in the National Centre for Biotechnology Infor-mation (NCBI) Sequence Read Archive (SRA) database under the accession number SRP065002. Further, the assembled contigs can be downloaded from the server on or after 16-10-2016 (1 year after the initial date of submission).

The pre-processing of the reads included adapter trim-ming wherein 30,023.45 Mb were filtered with 99.80 % read efficiency. The average read length after trimming was 125.8. The adapter trimming statistics is summarized in S1 Table. Further processing, qualified 234,382,719 reads with 28,988.89 Mb as clean reads. This suggests that 1.83 % of sequences did not pass the quality control check and hence were eliminated before the assembly process. The clean reads showed a mean and N50 length of 123.7 and 126 bp, respectively, with a GC% of 47.06.

De novo assembly of the visceral mass transcriptome

The Trinity de novo program was used to assemble the clean reads to contiguous overlapping sequences defined as contigs. A total of 315,924 contigs were obtained com-prising of 185 Mb of sequences. The mean length, N50 length (i.e., 50 % of total assembled sequences were hav-ing this length or longer contigs), and GC% of contigs were 585.6 bp, 715 bp, and 42.95 %, respectively. About

31.12 % of contigs had sizes of ≥500 bp with the larg-est contig of length 18,009 bp. These contigs were fur-ther assembled into 191,071 unigenes comprising of 129.55 Mb of sequences. While the length of unigenes ranged from 107 to 25,186 bp, the mean length, N50 length and GC% were 678 bp, 927 bp, and 42.84 %, respectively. The de novo assembly statistics for this study are given in Table 1. Considering the length distribution of the assem-bled contig sequences, almost 68.98 % of contigs were found shorter than 500 bp and only 11.81 % of contigs were longer than 1000 bp. Within the final assembled uni-genes, 61.14 % were of length ≤500 bp, 22.75 % were no more than 1000 bp, and 16.11 % were of length ≥1000 bp. The size distribution of assembled contigs and unigenes for this study is shown in Fig. 1.

Sequence similarity of assembled unigenes

The assembled unigenes were annotated against the public protein (PANM, COG, Swiss-Prot, GO, and KEGG data-bases) and nucleotide sequence database (Unigene) using BLASTX and BLASTN at an E value of ≤1E−5. Out of 191,071 assembled unigenes, 36.32 % of unigenes found a match to any one of the databases. A total of 62,758 uni-genes (32.85 %) found a match to PANM database, while 40,041 (20.96 %), 22,504 (11.78 %), 27,894 (14.6 %), 22,967 (12.02 %), and 2104 (1.1 %) unigenes showed homology to available sequences in Swiss-Prot, Unigene, COG, GO, and KEGG databases, respectively. A greater proportion of the annotated unigene sequences were having lengths of ≥1000 bp. In total, 33.3 % of all database anno-tated sequences were having lengths ≥1000 bp and 17.62 % of sequences had lengths in between 300 and 1000 bp. The sequence similarity of the assembled unigenes against public databases in this study is given in Table 2. A total of 15,868 unigene sequences show homology to protein entries in both PANM and COG database while 5124 and 2 unigenes show homology to sequences in both PANM and Unigene and Unigene and COG databases, respectively.

Table 1 Sequence assembly statistics after transcriptome sequencing using Illumina Next-Generation sequencing platform

Description Total number Total nucleotides (Mb) Mean length (bp) N50 length (bp) GC%

Raw reads 238,755,412 30,083.18 126 126 47.07

Clean reads 234,382,719 28,988.89 123.7 126 47.06

Contigs 315,924 185 585.6 715 42.95

Unigenes 191,071 129.55 678 927 42.84

Kmer 25

High-quality reads 98.17 % (sequences), 96.36 % (nucleotides)

Largest contig (bp) 18,009

No. of large contigs (≥500 bp) 98,300

Range of unigenes length (bp) 107–25,186

Mol Genet Genomics

1 3

A total of 12,018 unigene sequences showed matches to sequences in all three databases. The three-way Venn dia-gram depicting the unigene annotation to PANM, Unigene, and COG databases in this study is shown in Fig. 2. The top-hit species distribution of the unigenes against PANM database using BLASTX program is shown in Fig. 3. In total, 46.07 % of K. kurodana unigene sequences matched Aplysia californica protein sequences. This was signifi-cant as the other top-hit species accounted for a very small proportion of the unigene sequences. Matches against Lot-tia gigantea and Crassostrea gigas were limited to about 9.3 % and 7.11 %, respectively.

Continuing with the BLASTX annotation of the assem-bled unigenes of K. kurodana against PANM database at a cutoff of 1E−5, we obtained the homology match statistics as shown in Fig. 4. The E value distribution indicated that a major proportion of unigenes (67.11 %) had an E value between 1E−50 and 1E−5. Only 12.75 and 10.93 % of unigenes had E value of 1E−100 to 1E−50 and 0, respec-tively (Fig. 4a). Only 30.68 % of unigenes showed an iden-tity greater than 60 % to the matched database sequences

(Fig. 4b). The similarity distribution showed 65.24 % of unigenes to have a similarity of more than 80 % with the matched database sequences (Fig. 4c). The number of uni-genes hit to matched sequence in the database increased proportionately with the length of the unigenes. Sequences with length less than 500 bp showed a maximum unigene hit of approximately 25 %, while sequences of length above 2001 bp showed unigene hit of approximately 90 %. This suggests that a greater proportion of longer unigenes

Fig. 1 Size distribution of de novo assembled contigs (a) and unigenes (b) from the visceral mass transcriptome of K. kurodana

Table 2 Annotation hits of K. kurodana unigenes from transcriptome assembly to public databases

Databases All ≤300 bp 300–1000 bp ≥1000 bp

PANM-DB 62,758 10,735 29,897 22,126

Unigene 22,504 3596 10,123 8785

COG 27,894 2622 9964 15,308

GO 22,967 2863 9028 11,076

KEGG 2104 140 699 1,265

InterProScan 24,136 2520 9032 12,584

SwissProt 40,041 5240 16,708 18,093

ALL 69,393 12,361 34,064 22,968

Fig. 2 A Venn diagram showing the annotation profile of K. kurodana unigenes against PANM-DB (Protostomia database), Uni-gene DB, and COG DB

Mol Genet Genomics

1 3

Fig. 3 Top-hit species distribution of K. kurodana unigenes against PANM-DB using BLASTX

Fig. 4 Homology search of K. kurodana unigenes using BLASTX against PANM-DB. a BLAST E value distribution analysis, b BLAST iden-tity distribution, c BLAST similarity distribution, and d unigene hit and non-hit ratio

Mol Genet Genomics

1 3

would find matches to a homologous sequence in the data-base. The hit and non-hit ratio of K. kurodana unigenes to PANM database is shown in Fig. 4d.

Functional classification of assembled unigenes

The assembled unigenes of K. kurodana were ascribed pre-dicted functions based on COG classifications, GO terms, and KEGG orthology mappings. Furthermore, the unigenes were searched for conserved protein domains using the InterPro protein domain search. COG database includes orthologous proteins and classifies the gene products into broad functional categories including a class of general function prediction and uncharacterized COGs (Tatusov et al. 2001). A phylogenetic classification of the coding sequences of assembled unigenes within COG database for K. kurodana unigenes is shown in Fig. 5. We find that the assembled unigenes were classified into 25 functional COG categories. The ‘multi’ category represents those unigene sequences that belong to more than one of the functional COG categories. Among the representative COG catego-ries, ‘general function prediction’ and ‘signal transduction mechanisms’ were prominent with 21.1 and 10.9 % of uni-genes classified, respectively. Following the larger COG categories, the unigenes were also predicted to classify

under ‘function unknown (6.8 %)’, ‘posttranslational modi-fication, protein turnover and chaperones (5.5 %)’, and ‘transcription (4.2 %)’ functional groups. The COGs with lowest number of unigenes were ‘defense mechanisms (0.8 %)’, ‘chromatin structure and dynamics (0.6 %)’, ‘coenzyme transport and metabolism (0.5 %)’, and ‘cell motility (0.2 %)’.

The assembled unigenes from this study were subjected to GO analysis that identifies the transcribed region and suggests the grouping of a gene to known (or predicted) function under biological process, cellular component, and molecular functions. This is an international standard-ized gene functional classification that only suggests the putative products in an organism. Out of the 22,967 uni-genes annotated to GO database, 39.61 % unigenes were specifically classified under molecular function, 3.61 % under biological process, and 3.13 % under cellular com-ponent category. Many unigenes classified under more than one GO functional category showing the overlap-ping structure in the Venn diagram (Fig. 6a). A total of 28.95 % unigenes were grouped to biological process and molecular function, 3.65 % to biological process and cel-lular component, and 2.69 % to cellular component and molecular function. About 18.35 % of unigenes got clas-sified under all the three GO functional categories. A total

Fig. 5 COG classification of K. kurodana unigenes to functional categories

Mol Genet Genomics

1 3

of 7026 assembled unigene sequences annotated against GO database were suggested to classify with a single GO term. The remaining annotated sequences got ascribed to more than one GO term. The classification of GO anno-tated unigene sequences to number of GO terms has been shown in Fig. 6b. The annotations of unigenes under GO terms at level 2 for this study have been shown in Fig. 7. The most predominant GO terms under biological process category were cellular process (GO: 0009987), metabolic process (GO: 0008152), and single-organism process (GO: 0044699). Many unigenes were also predicted to function under response to stimulus (GO: 0050896) and signaling (GO: 0023052) (Fig. 7a). Under the cellular component category, the most prominent GO terms included mem-brane (GO: 0016020), cell junction (GO: 0005623), orga-nelle (GO: 0043226), and macromolecular complex (GO: 0032991) (Fig. 7b). In the category of molecular function, the term binding (GO: 0005488) showed the highest pro-portion of annotations, followed by catalytic activity (GO: 0003824) (Fig. 7c).

The KEGG pathway database supports the study of the biological functions of genes by elucidating the regula-tory pathway specific to particular organisms (Altermann and Klaenhammer 2005). The mapping of identified uni-genes of K. kurodana in the KEGG database revealed 4725 sequences with known enzymes (661) assigned to intracel-lular metabolic pathways, organismal systems, environ-mental information processing, and genetic information processing pathways. Within the predominant metabolism group, the unigenes sequences were predicted to func-tion largely under nucleotide metabolism (1516 unigenes with 9 enzymes) followed with metabolism of cofactors

and vitamins (964 unigenes with 32 enzymes), carbohy-drate metabolism (402 unigenes with 48 enzymes), and xenobiotics biodegradation and metabolism (356 unigenes with 165 enzymes). All the unigene sequences predicted to genetic information processing, environmental infor-mation processing and organismal systems were found to fall under translation (aminoacyl-tRNA biosynthesis), sig-nal transduction (mTOR and phosphatidylinositol signal-ing system), and immune system (T-cell receptor signal-ing pathway), respectively. The KEGG pathway analysis obtained in this study is shown in Fig. 8. Furthermore, as shown in S2 Table, most unigenes under metabolism were classified to purine metabolism, most under metabolism of cofactors and vitamins to thiamine metabolism, and most under xenobiotics biodegradation and metabolism to aminobenzoate degradation. We made an attempt to iden-tify unigenes suggested to participate in the adaptation and defense mechanisms of the snail species (Table 3). Among the candidate gene families, AMP-activated protein kinase (AMPK), heat shock proteins 70, insulin receptor sub-strate-1, Toll-like receptor 4 (TLR4), and aquaporin-2 were represented by the unigenes.

The BLAST2GO annotation, directly performed on the unigene sequences, was useful in the identification of Inter-Pro conserved domains in 12.63 % of the sequences. The 25 top-hit InterPro domains has been listed in Table 4. The IPR015880 (Zinc finger, C2H2-like domain) was the most represented (1739 unigene sequences), followed up with IPR012337 (Ribonuclease H-like domain) and IPR000477 (reverse transcriptase domain). Other notable top-hit Inter-Pro domains identified in this study includes IPR002110 (ankyrin repeat), IPR001680 (WD40 repeat), IPR013783

Fig. 6 Annotation profile of K. kurodana unigenes against the GO database. a Venn diagram showing the predicted classification of uni-genes to GO categories of biological process, cellular component and

molecular function. b An analysis of GO annotated unigenes based on the number of GO terms

Mol Genet Genomics

1 3

(Immunoglobulin-like fold domain), IPR000504 (RNA rec-ognition motif domain), IPR001304 (C-type lectin domain) and IPR002126 (cadherin domain).

SSR detection in the assembled unigenes

For SSR identification, we examined 30,827 assem-bled unigene sequences (≥1000 bp) with a total base size of 61,674,216 bp. We identified 6924 SSRs in 4939 sequences, with 1290 sequences containing more than one SSR. Additionally, we found 1269 SSRs in the compound form. Among the identified SSRs, dinucleotide (3406, 49.19 %) repeats were the predominant followed by tri-nucleotide (2628, 37.95) and tetranucleotide (11.57 %) repeats. While we selected for a minimum of six repeat numbers for dinucleotide repeats, we screened for trinu-cleotide repeats and other nucleotide repeats (tetra-, penta-, and hexanucleotides) with a minimum repeat number of five and four, respectively. As shown in Fig. 9a, the pre-dominant repeat numbers were six, five, and four for di-, tri-, and all other nucleotide repeats, respectively. Consider-ing the repeat motif types identified in this study, AC/GT

(1764, 25.48 %), AG/CT (17.35 %), and AT/AT (5.68 %) formed the predominant dinucleotide motif types. ATC/ATG (1007, 14.54 %) and ACAG/CTGT (1.98 %) formed the predominant trinucleotide and tetranucleotide motif type, respectively. The frequency of repeat motif types obtained in this study is shown in Fig. 9b. For future stud-ies on the polymorphic assessment of the species diver-sity, we have specified the primer sequences flanking the di- (minimum 10 repeat numbers), tri-, and tetranucleotides (minimum 7 repeat numbers) (S3 Table). Many of the uni-genes with SSR primer sequences suggests putative func-tional status such as T-cell receptor, TRAF3 interacting protein, RAB2, beclin 1-like isoform, acetylcholine recep-tor subunit, and fibrinogen-related protein 1 that are essen-tial components of innate immunity system and defense mechanisms.

Discussion

Next Generation Sequencing technologies are in the fore-front of revolutionizing the field of genetics and genomics.

Fig. 7 GO term classification. The annotated unigenes against GO database were ascribed to GO terms under biological process (a), cellular component (b), and molecular function (c) category

Mol Genet Genomics

1 3

The mRNA sequencing of non-model invertebrates have been gaining popularity among researchers as it promises to establish the species at the interface of ecology and evo-lution. The transcriptomes of economically important, rare and endangered species of molluscs have been investigated in separate studies using Next Generation platforms such as 454 and Illumina (Werner et al. 2012; Clark and Thorne 2015). In this study, we targeted the transcriptome of K. kurodana, a vulnerable gastropod species endemic to South Korea. It was hypothesized that, in order to conserve the species, a phenotypic screen of favorable traits are neces-sary that could manifest towards manipulation of adaptive features. The visceral mass transcriptome clarified a total of 234,382,719 high quality reads assembling to overlapping contiguous sequences. The average and N50 contig length thus obtained were greater in this study than that obtained with Illumina sequencing and de novo contig assembly of molluscs such as Radix balthica (Feldmeyer et al. 2011), Haliotis midae (Franchini et al. 2011), Chlamys farreri (Shi et al. 2013), Echinolittorina malaccana (Wang et al. 2014), and Anadara trapezia (Prentis and Pavasovic 2014). Con-sidering the evaluation of de novo transcriptome assem-blies wherein reference sequences are unavailable, median contig length, N50 contig length and number of contigs are common indicators of quality, although higher total contig lengths may also suggest a greater redundancy of

sequences (Ren et al. 2012; Zhang et al. 2014). Similarly, N50 is suggested to provide a measure of continuity but not accuracy of contigs (Salzberg et al. 2012). Some reports have even suggested the coverage of mitochondrial genes as a parameter for the completeness of assembly process (Coppe et al. 2010; Zhang et al. 2014; Patnaik et al. 2016a), but this study is impossible in organisms without the mito-chondrial transcriptome. Hence, to our understanding no general criterion satisfies the quality of de novo transcrip-tome assembly.

The derived information on the unigene sequences, obtained after clustering of the contig assembly were con-sidered for the discovery of known genes. The query uni-gene sequences were matched with the subject homolo-gous sequences from the public databases. The maximum matches of unigenes to PANM database in the present study is justified as it is a refined protostome group data-base used to improve the efficiency of the annotation in a reduced time when compared with NCBI non-redundant database (Kang et al. 2015). About 64.05 % of unigenes remain unannotated and we assume that these would be short sequences with the absence of a conserved protein domain. It is also expected that the unannotated sequences may be non-coding genes, untranslated regions or just ran-dom transcriptional noise. The unigenes showing greater match to A. californica protein sequences is possibly due

Fig. 8 Kyoto Encyclopedia of Genes and Genomes (KEGG) path-way analysis. The KEGG annotated sequences to different ontolo-gies are shown (outer circle). The total number of enzyme sequences

within each pathway is shown in the inner circle. Each pathway is represented in a different color

Mol Genet Genomics

1 3

to extensive genomic information for the species (Broad Institute, A. californica genome project). Additionally, the neuronal transcriptome (Moroz et al. 2006), developmen-tal transcriptome (Heyland et al. 2011), and early life his-tory stages transcriptome (Fiedler et al. 2010) of A. califor-nica also have been reported suggesting its importance as a model system for cellular and molecular studies involving learning and behavior (Leonard and Edstrom 2004; Choi et al. 2010). Similarly, genomic resources for the freshwa-ter snail, L. gigantea (Simakov et al. 2013) and the oyster, C. gigas (Zhang et al. 2012) are abundant that could reflect a greater homology of K. kurodana unigenes. The unigene annotation results in this study seems appropriate espe-cially when related to the minor BLAST hits to mollusca sequences reported for other molluscan transcriptomes (Sadamoto et al. 2012). The BLAST annotation procedure involving databases has been considered as the best utiliz-able model in ascertaining the unigene direction in most non-model species (Rao et al. 2015; Chen et al. 2015; Pat-naik et al. 2015).

The COG classification of unigenes in this study are in consistent with studies in other non-model organisms

including molluscs (Lv et al. 2014; Patnaik et al. 2016a, b). GO analysis was conducted on annotated transcripts using BLAST2GO software. GO classification is not an evidence of functionality as all GO terms do not show equal validity (Rhee et al. 2008; Rogers and Ben-Hur 2009). Here, con-sidering the GO terms, we emphasize the point that each GO term has an associated evidence code. In this study, as in many other species, the majority of terms are assigned the code ‘IEA’ (inferred from electronic annotation); these are terms that have no experimental verification and hence would be of questionable validity. Additionally, GO terms also contain low quality terms with evidence codes such as ‘NAS’ (non-traceable author statement) or ‘ND’ (no bio-logical data available). Hence, the GO interpretation of uni-genes in this study is only predicted.

We also identified the unigenes responsible for a regu-latory function in defense mechanism and adaptation. AMPK is called the “fuel gauge” of the cell as it regu-lates the catabolic ATP-generating and anabolic ATP con-suming processes. AMPK helps in the adaptation of land snails as it responds to hypoxia, glucose deprivation, various cytokines, elevated intracellular Ca2+ and tumor

Table 3 List of adaptation-related unigenes identified from K. kurodana transcriptome

Candidate genes family Number of unigene

Full name Symbol Also known as

Angiotensin I converting enzyme ACE DCP; ICH; ACE1; DCP1; CD143; MVCD3 545

Adenylate cyclase activating polypeptide 1 ADCYAP1 PACAP 376

Angiotensinogen AGT ANHU; SERPINA8 219

AMP-activated protein kinase AMPK AK; AmpKalpha; CG3051; dAMPK; Gprk4; SNF1A 6026

Aquaporin 2 AQP2 AQP-CD; WCH-CD 2481

Basic helix–loop–helix family, member e40 BHLHE40 DEC1; HLHB2; BHLHB2; STRA13; Stra14; SHARP-2

36

Basic helix–loop–helix family, member e41 BHLHE41 DEC2; hDEC2; BHLHB3; SHARP1 80

Chromosome 9 open reading frame 3 C9ORF3 APO; AP-O; AOPEP; ONPEP; C90RF3 5

Collagen type I alpha 1 COL1A1 OI1; OI2; OI3; OI4; EDSC 847

Deiodinase, iodothyronine DIO 5DI; TXDI1 199

Endothelial PAS domain protein 1 EPAS1 HLF; MOP2; ECYT4; HIF2A; PASD2; bHLHe73 168

Glutamate ionotropic receptor delta type subunit 1 GRID1 GluD1; GluRdelta1 142

Glutamate ionotropic receptor NMDA type subunit 2B

GRIN2B MRD6; NR2B; hNR3; EIEE27; GluN2B; NMDAR2B

470

Heat shock proteins 70 HSP70 HSPA6; HSP70B 4160

Insulin receptor substrate 1 IRS1 G972R; IRS-1 3066

Mitogen-activated protein kinase kinase kinase 15 MAP3K15 ASK3; bA723P2.3 51

Phospholipase A2 group XIIA PLA2G12A GXII; ROSSY; PLA2G12 4

Regulator of cell cycle RGCC RGC32; RGC-32; C13orf15; bA157L14.2 92

Somatolactin SL 74

Solute carrier family 14 member 2 SLC14A2 UT2; UTA; UTR; HUT2; UT-A2; hUT-A6 1

Stimulated by retinoic acid 6 STRA6 MCOPS9; MCOPCB8; PP14296 585

T-box 5 TBX5 HOS 903

Toll-like receptors4 TLR4 TIL4; CD282 2907

Mol Genet Genomics

1 3

suppressors (Hardie 2007; Storey and Storey 2010). Heat shock protein 70 family of proteins are known stress mark-ers and have been implicated in the defense mechanisms and adaptation of the land snail species (Wang et al. 2013). TLRs with high identity to TLR4 have also been identi-fied in the transcriptome of freshwater snail, Oncomelania hupensis, responsible for Toll signaling process (Zhao et al.

2015). The discovery of unigenes predicted to biochemical pathways related to metabolism, immunity and defense will be useful in advancing the understanding of K. kurodana functions under biotic and abiotic stressors.

With the InterPro domain analysis, significant attrib-utes of the unigenes could be suggested. The most com-mon zinc finger domain proteins can interact with the

Table 4 Top-hit 25 InterProScan protein domains annotated within the unigene sequences of K. kurodana

Domain Description Unigene

IPR015880 Zinc finger, C2H2-like domain 1739

IPR012337 Ribonuclease H-like domain 432

IPR000477 Reverse transcriptase domain 396

IPR002110 Ankyrin repeat 364

IPR013087 Zinc finger C2H2-type/integrase DNA-binding domain 323

IPR027417 P-loop containing nucleoside triphosphate hydrolase domain 261

IPR002290 Serine/threonine/dual specificity protein kinase, catalytic domain 226

IPR003591 Leucine-rich repeat, typical subtype repeat 195

IPR001680 WD40 repeat 184

IPR005135 Endonuclease/exonuclease/phosphatase domain 183

IPR013783 Immunoglobulin-like fold domain 176

IPR000504 RNA recognition motif domain 173

IPR000276 G protein-coupled receptor, rhodopsin-like family 172

IPR011701 Major facilitator superfamily 156

IPR001304 C-type lectin domain 153

IPR002126 Cadherin domain 145

IPR002048 EF-hand domain 138

IPR000742 EGF-like domain 132

IPR011989 Armadillo-like helical domain 126

IPR001841 Zinc finger, RING-type domain 122

IPR012336 Thioredoxin-like fold domain 119

IPR001888 Transposase, type 1 family 107

IPR001478 PDZ domain 103

IPR001245 Serine-threonine/tyrosine-protein kinase catalytic domain 101

IPR000008 C2 domain 98

Fig. 9 SSR repeats and types identified from the assembled unigene sequences. a The frequency of SSR repeat types in relation to the repeat numbers. b The diverse repeat motif types and their frequency of occurrence in the unigene sequences

Mol Genet Genomics

1 3

nucleotides (DNA, RNA), other proteins and lipids and contribute to regulation of transcription, processing and stability of mRNA, and protein turnover (Ravasi et al. 2003). They are most often the top-hit abundant domains in transcriptomes of non-model organisms including mol-luscs (Firmino et al. 2013; Zhao et al. 2014; Patnaik et al. 2015, 2016b). The ribonuclease H-like domain is included with many such unique domain sequences under Ribonu-clease H-like (RNHL) superfamily involved to contribute in many biological processes including RNA interference (Majorek et al. 2014). The ankyrin repeat motif (33-resi-due) is also a frequently found domain in many proteins that serves protein–protein interactions (Mosavi et al. 2004). The functions of the proteins in which ankyrin repeats are populated includes regulation of transcrip-tion, embryogenesis (Notch protein in Drosophila), tumor suppressor, and transcription factor (Swi4 and Swi6 in yeasts) (Voronin and Kiseleva 2008). WD-40 repeats, also a motif contributing to protein–protein interactions and involved in cell death and transcriptional regulation has been found in the transcriptome of Crassostrea virginica (Zhang et al. 2014) and Cristaria plicata (Patnaik et al. 2016b). The immunoglobulin-like fold domain is the most abundant of immunoglobulin-related domains pivoting essential functions in vertebrate immune responses and has been reported from the transcriptomes of non-model species (Pallavicini et al. 2013; Zhao et al. 2014; Patnaik et al. 2015). The C-type lectin domain proteins are primar-ily carbohydrate-binding proteins called lectins, but there are many members with this domain those are not lectins. Basically, these proteins contribute to innate immunity in invertebrates including molluscs (Joubert et al. 2010; Zhang et al. 2014; Patnaik et al. 2016b). The cadherin domain has also been abundantly noticed in transcriptome analysis of molluscs (Zhang et al. 2012; McDowell et al. 2014; Patnaik et al. 2016b).

An important utility of high-throughput transcriptome characterization is to identify and validate microsatellite markers within the putative transcribed regions. Microsat-ellite identification including the development of SSRs and SNPs provide for a convenient platform to assess popula-tion diversity at the genetic level and serves to conserve the gene pool for rare and endangered species. The limi-tations to the development of efficient genetic markers in non-model organisms included the time and cost proto-cols as many of these were screened from whole-genome libraries. But lately, the cDNA based microsatellite iden-tification has accelerated the screening, development, and application of markers for the benefit of non-model organisms (Zalapa et al. 2012; Lopez-Uribe et al. 2012). Advances in NGS technologies, bioinformatics platforms and algorithms have helped for an efficient detection of

simple repeated sequences in the functional transcripts. The transcriptome-derived microsatellites are more ame-nable to studies related to conservation genomics as these correlate to known (or predicted) genes and show better transferability among different species (Kashi et al. 1997; Wang and Guo 2007). Lately, many transcriptome studies on non-model species characterized microsatellites and their direct application towards marker-assisted selection (MAS), growth and reproductive performance, and quanti-fication of genetic diversity within and among the popula-tions of rare and endangered species (Mikheyev et al. 2010; Zheng et al. 2013; Ma et al. 2014; Yue et al. 2015; Barat et al. 2016; Patnaik et al. 2016b). In this study, we screened SSR microsatellite sequences and analyzed the di- to hexa-nucleotide repeats both in terms of the number of repeats and repeat motif types. Furthermore, we recorded the puta-tive functions of the SSR derived unigenes and proposed a set of primer sequences flanking the SSR sites for polymor-phism identification in the species and studies on popula-tion diversity.

Our results were in contrast with the SSRs detected from the gonad transcriptome of Pinctada margaritifera, where tetranucleotide repeats formed more than half of the SSRs (Teaniniuraitemoana et al. 2014). In case of C. virginica, dinucleotide repeats were found more frequently after the abundant mononucleotide repeats (Zhang et al. 2014). We have ignored the screening of mononucleotide repeats in the present study due to the possibility of accessing homopolymers from Illumina sequencing errors. The most abundant dinucleotide repeat motif types in the oyster C. plicata were also found to be similar, but the predominant tri- (AAT/ATT) and tetranucleotide repeat (ACAT/ATGT) motif types were different (Patnaik et al. 2016b). The SSRs identified in this study will support the study of popula-tion structure, phylogenetics, and phylogeography of the species.

Here we report the first transcriptome survey of genetic resources from Koreanohadra kurodana, a non-model species endemic to Korea facing extinction. The 191,071 unigenes identified and annotated using bioinformatics resources enable functional classification to cellular, bio-logical and molecular processes. Many biochemical path-ways in the species were revealed due to the association of the transcripts including enzymes to them. In addition, identification of suitable SSRs provide viable targets for polymorphism studies across Koreanohadra kurodana pop-ulations and exploring genetic identity or diversity within related species.

Acknowledgments This work was supported by the grant entitled “The Genetic and Genomic Evaluation of Indigenous Biological Resources” funded by the National Institute of Biological Resources (NIBR201503202) and Soonchunhyang University Research Fund.

Mol Genet Genomics

1 3

Compliance with ethical standards

All applicable international, national, and/or institutional guidelines for the care and use of animals were followed.

Conflict of interest The authors declare that they have no conflict of interest.

References

Altermann E, Klaenhammer TR (2005) Pathway voyager: pathway mapping using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database. BMC Genom 6:60. doi:10.1186/1471.2164-6-60

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarkis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consor-tium. Nat Genet 25:25–29

Barat A, Kumar R, Goel C, Singh AK, Sahoo PK (2016) De novo assembly and characterization of tissue-specific transcriptome in the endangered golden mahseer, Tor putitora. Meta Gene 7:28–33

Castellanos-Martinez S, Arteta D, Catarino S, Gestal C (2014) De novo transcriptome sequencing of the Octopus vulgaris hemo-cytes using Illumina RNASeq technology: response to the infec-tion by the gastrointestinal parasite Aggregata octopiana. PLoS One 9:e107873

Chen DN, Zhang GQ (2004) Fauna Sinica (Mollusca: Gastropoda: Stylommatophora: Bradybaenidae). Science Press, Beijing, pp 140–142

Chen X, Yang Q, Li H, Li H, Hong Y, Pan L, Chen N, Zhu F, Chi X, Zhu W, Chen M, Liu H, Yang Z, Zhang E, Wang T, Zhong N, Wang M, Liu H, Wen S, Li X, Zhou G, Li S, Wu H, Varshney R, Liang X, Yu S (2015) Transcriptome-wide sequencing provides insights into geocarpy in peanut (Arachis hypogea L.). Plant Bio-technol. doi:10.1111/pbi.12487

Choi SL, Lee Y-S, Rim Y-S, Kim T-H, Moroz LL, Kandel ER, Bhak J, Kaang B-K (2010) Differential Evolutionary rates of neuronal transcriptome in Aplysia kurodai and Aplysia californica as a tool for gene mining. J Neurogenet 24:75–82

Clark MS, Thorne MA (2015) Transcriptome of the Antarctic brood-ing gastropod mollusc Margarella antartica. Mar Genomics 24:231–232

Collins LJ, Biggs PJ, Voelckel C, Joly S (2008) An approach to tran-scriptome analysis of non-model organisms using short-read sequences. Genome Inform 21:3–14

Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M (2005) Blast2GO: a universal tool for visualization and analysis in functional genomics research. Bioinformatics 21:3674–3676

Coppe A, Pujolar JM, Maes GE, Larsen PF, Hansen MM, Bernatchez L, Zane L, Bortoluzzi S (2010) Sequencing, de novo annotation and analysis of the first Anguilla anguilla transcriptome: Eeel-Base opens new perspectives for the study of critically endan-gered European eel. BMC Genom 11:635

Deng Y, Lei Q, Tian Q, Xie S, Du X, Li J, Wang L, Xiong Y (2014) De novo assembly, gene annotation, and simple sequence repeat marker development using Illumina paired-end transcriptome sequences in the pearl oyster Pinctada maxima. Biosci Biotech-nol Biochem 78:1685–1692

Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequenc-ing, assembly and annotation. Evol Appl 7:1026–1042

Feldmeyer B, Wheat CW, Krezdorn N, Rotter B, Pfenninger M (2011) Short read Illumina data for the de novo assembly of a

non-model snail species transcriptome (Radix balthica, Basom-matophora, Pulmonata), and a comparison of assembler perfor-mance. BMC Genom 12:317

Fiedler TJ, Hudder A, McKay SJ, Shivkumar S, Capo TR, Schmale MC, Walsh PJ (2010) The transcriptome of early life history stages of the California Sea Hare Aplysia californica. Comp Bio-chem Physiol Part D Genomics Proteomics 5:165–170

Firmino AAP, Fonseca FC, de Macedo LLP, Coelho RR, de Souza Jr JDA, Togawa RC, Silva-Junior OB, Pappas-Jr GJ, Mattar da Silva MC, Engler G, Grossi-de-Sa MF (2013) Transcriptome analysis in cotton ball weevil (Anthonomus grandis) and RNA interference in insect pests. PLoS One 8:e85079

Franchini P, van der Merwe M, Roodt-Wilding R (2011) Transcrip-tome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis. BMC Res Notes 4:59

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nus-baum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNAseq data without a ref-erence genome. Nat Biotechnol 29:644–652

Hardie DG (2007) AMP-activated/SNF-1 kinase: conserved guardians of cellular energy. Nat Rev Mol Cell Biol 8:774–785

Heyland A, Vue Z, Voolstra CR, Medina M, Moroz LL (2011) Devel-opmental transcriptome of Aplysia californica. J Exp Zool B Mol Dev Evol 316B:113–134

Hirano T, Kameda Y, Chiba S (2014) Phylogeny of the land snails bradybaena and phaeohelix (Pulmonata: Bradybaenidae) in Japan. J Mollusc Studies, pp 1–7

Huang C-W, Lee Y-C, Lin S-M, Wu W-L (2014) Taxonomic revision of Aegista subchinensis (Mollendorf, 1884) (Stylommatophora, Bradybaenidae) and a description of new species of Aegista from eastern Taiwan based on multilocus phylogeny and comparative morphology. Zookeys 445:31–55

Hwang H-J, Patnaik BB, Kang SW, Park SY, Wang TH, Park EB, Chung JM, Song DK, Patnaik HH, Kim C, Kim S, Lee JB, Jeong HC, Park HS, Han YS, Lee YS (2016) RNA sequencing, de novo assembly and functional annotation of a Nymphalid butterfly, Fabriciana nerippe Felder, 1862. Entomol Res. doi:10.1111/1748-5967.12164

Joubert C, Piquemal D, Marie B, Manchon L, Pieratt F, Zanella-Cleon I, Cochennec-Laureau N, Gueguen Y, Montagnani C (2010) Transcriptome and Proteome analysis of Pinctada margaritifera calcifying mantle and shell: focus on biomineralization. BMC Genom 11:613

Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38:355–360

Kang SW, Park SY, Patnaik BB, Hwang HJ, Kim C, Kim S, Lee JS, Han YS, Lee YS (2015) Construction of PANM Database (Proto-stome DB) for rapid annotation of NGS data in mollusks. Korean J Malacol 31:243–247

Kashi Y, King D, Soller M (1997) Simple sequence repeats as a source of quantitative genetic variation. Trends Genet 13:74–78

Kuznik-Kowalska E, Lewandowska M, Pokroyszko BM, Prockow M (2013) Reproduction, growth and circadian activity of the snail Bradybaena fruticum (O.F. Muller, 1774) (Gastropoda: Pulmonata: Bradybaenidae) in the laboratory. Cent Eur J Biol 8:693–700

Kwon OK, Park KM, Lee JS (1993) Colored shells of Korea. Aca-demic Publishing Co., Seoul

Lee JS, Kwon OK (1993) Chromosomal studies of eight species of Bradybaenidae in Korea. Korean J Malacol 9:30–43

Leonard JL, Edstrom JP (2004) Parallel processing in an identified neural circuit: the Aplysia californica gill-withdrawal response model system. Biol Rev Camb Philos Soc 79:1–59

Mol Genet Genomics

1 3

Lopez-Uribe CK, Santiago SM, Bogdanowicz BND (2012) Discov-ery and characterization of microsatellites for the solitary bee Colletes inaequalis using Sanger and 454 pyrosequencing. Apid-ologie 44:163–172

Lv J, Liu P, Gao B, Wang Y, Wang Z, Chen P, Li J (2014) Transcrip-tome analysis of Portunus trituberculatus: de novo assembly, growth-related gene identification and marker discovery. PLoS One 9:e94055

Ma J-Q, Yao M-Z, Ma C-L, Wang X-C, Jin J-Q, Wang X-M, Chen L (2014) Construction of a SSR based genetic map and identifica-tion of QTLs for catechins content in Tea plants (Camellia sinen-sis). PLoS One 9:e93131

Majorek KA, Dunin- Horkawicz S, Steczkiewicz K, Muszewska A, Nowotny M, Ginalski K, Bujnicki JM (2014) The RNase H superfamily: new members, comparative structural analysis and evolutionary classification. Nucleic Acids Res 42:4160–4179

Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison of gene expression arrays. Genome Res 18:1509–1517

Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682

McDowell IC, Nikapitiya C, Aguiar D, Lane CE, Istrail S, Gomez-Chiarri M (2014) Transcriptome of American oysters, Cras-sostrea virginica, in response to bacterial challenge: insights into potential mechanisms of disease resistance. PLoS One 9:e105097

Meng J, Zhu Q, Zhang L, Li C, Li L, She Z, Huang B, Zhang G (2013) Genome and Transcriptome analyses provide insight into the euryhaline adaptation mechanism of Crassostrea gigas. PLoS One 8:e58563

Mikheyev AS, Vo T, Wee B, Singer MC, Parmesan C (2010) Rapid microsatellite isolation from a Butterfly by De novo transcrip-tome sequencing: performance and a comparison with AFLP-derived distances. PLoS One 5:e11212

Min DK, Lee JS, Koh DB, Je JK (2004) Molluscs in Korea. Hangeul Publishing Company, Min Molluscan Research Institute, Korea, p 566

Moroz LL, Edwards JR, Puthanveettil SV, Kohn AB, Ha T, Heyland A, Knudsen B, Sahni A, Yu F, Liu L, Jezzini S, Lovell P, Ian-nucculli W, Chen M, Nguyen T, Sheng H, Shaw R, Kalachikov S, Panchin YV, Farmerie W, Russo JJ, Ju J, Kandel ER (2006) Neuronal transcriptome of Aplysia: neuronal compartments and circuitary. Cell 127:1453–1467

Mosavi LK, Cammett TJ, Desrosiers DC, Peng Z (2004) The ankyrin repeat as molecular architecture for protein recognition. Protein Sci 13:1435–1448

Noseworthy RG, Lim N-R, Choi K-S (2007) A catalogue of the mollusks of the Jeju Island, South Korea. Korean J Malacol 23:65–104

Pallavicini A, Canapa A, Barucca M, Alfoldi J, Biscotti MA, Buono-core F, De Moro G, Di Palma F, Fausto AM, Forconi M, Ger-dol M, Makapedua DM, Turner-Meier J, Olmo E, Scapigliati G (2013) Analysis of the transcriptome of the Indonesian coela-canth Latimeria menadoensis. BMC Genom 14:538

Park G-B (2011) Karyotypes of Korean endemic land snail, Koreano-hadra kurodana (Gastropoda: Bradybaenidae). Korean J Malacol 27:87–90

Patnaik BB, Hwang H-J, Kang SW, Park SY, Wang TH, Park EB, Chung JM, Song DK, Kim C, Kim S, Lee JB, Jeong HC, Park HS, Han YS, Lee YS (2015) Transcriptome characterization for non-model endangered lycaenids, Protantigius superans and Spindasis takanosis, using Illumina HiSeq 2500 sequencing. Int J Mol Sci 16:29948–29970

Patnaik BB, Park SY, Kang SW, Hwang H-J, Wang TH, Park EB, Chung JM, Song DK, Kim C, Kim S, Lee JB, Jeong HC, Park HS, Han YS, Lee YS (2016a) Transcriptome profile of the Asian

Giant Hornet (Vespa mandarinia) using Illumina HiSeq 4000 sequencing: De novo assembly, functional annotation, and dis-covery of SSR markers. Int J Genomics, Article ID 4169587. doi:10.1155/2016/4169587

Patnaik BB, Wang TH, Kang SW, Hwang H-J, Park SY, Park EB, Chung JM, Song DK, Kim C, Kim S, Lee JS, Han YS, Park HS, Lee YS (2016b) Sequencing, de novo assembly, and annotation of the transcriptome of the endangered fresh water pearl bivalve Cristaria plicata provides novel insights into functional genes and marker discovery. PLoS One 11:e0148622

Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, Tsai J, Quackenbush J (2003) TIGR Gene Indices Clustering Tool (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19:651–652

Prentis PJ, Pavasovic A (2014) The Anadara trapezia transcriptome: a resource for molluscan physiological genomics. Mar Genomics Pt B 18:113–115

Rao R, Zhu YB, Alinejad T, Tiruvayipati S, Thong KL, Wang J, Bhassu S (2015) RNA-seq analysis of Macrobrachium rosenber-gii hepatopancreas in response to Vibrio parahaemolyticus infec-tion. Gut Pathogens 7:6

Ravasi T, Huber T, Zavolan M, Forrest A, Gaasterland T, Grimmond S, RIKEN GER Group, GSL Members, Hume DA (2003) Sys-tematic characterization of the zinc-finger-containing proteins in the mouse transcriptome. Genome Res 13:1430–1442

Ren X, Liu T, Dong J, Sun L, Yang J, Zhu Y, Jin Q (2012) Evaluating de Bruijin graph assemblers on 454 transcriptomic data. PLoS One 7:e51188

Rhee SY, Wood V, Dolinski K, Draghici S (2008) Use and misuse of the gene ontology annotations. Nat Rev Genet 9:509–515

Rogers MF, Ben-Hur A (2009) The use of gene ontology evidence codes in preventing classifier assessment bias. Bioinformatics 25:1173–1177

Sadamoto H, Takahashi H, Okada T, Kenmoku H, Toyota M, Asakawa Y (2012) De novo sequencing and transcriptome analysis of the central nervous system of mollusc Lymnaea stagnalis by deep RNA sequencing. PLoS One 7:e42546

Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA (2012) GAGE: a critical evaluation of genome assemblies assembly algorithms. Genome Res 22:557–567

Schileyko AA (2004) Treatise on recent terrestrial pulmonate mol-luscs. Pt. 12. Bradybaenidae, Monodeniidae, Xanthonychidae, Epiphragmophoridae, Helminthoglyptidae, Elonidae, Humbold-tianidae, Sphincterochilidae, Cochlicellidae. Ruthenica Suppl 2:1627–1763

Shi X, Gupta S, Lindquist IE, Cameron CT, Mudge J, Rashotte AM (2013) Transcriptome analysis of cytokinin response in tomato leaves. PLoS One 8:e55090

Simakov O, Larsson TA, Arendt D (2013) Linking macro- and micro-evolution at the cell type level: a view from the lophotrochozoan Platynereis dumerilii. Brief Funct Genomics 12:430–439

Storey KB, Storey JM (2010) Metabolic regulation and gene expres-sion during aestivation. In: Navas CA, Carvalho JE (eds) Aes-tivation: molecular and physiological aspects, progress in molecular and subcellular biology, vol 49. Springer, Berlin. doi:10.1007/978-3-642-02421-4_2

Sysoev A, Schileyko A (2009) Land snails and slugs of Russia and adjacent countries. Pensoft, Sofia

Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV (2001) The COG database: new developments in phyloge-netic classification of proteins from complete genomes. Nucleic Acids Res 29:22–28

Mol Genet Genomics

1 3

Teaniniuraitemoana V, Huvet A, Levy P, Klopp C, Lhuillier E, Gaert-ner-Mazouni N, Gueguen Y, Le Moullac G (2014) Gonad tran-scriptome analysis of pearl oyster Pinctada margaritifera: iden-tification of potential sex differentiation and sex determining genes. BMC Genom 15:491

Voronin DA, Kiseleva EV (2008) Functional role of proteins contain-ing ankyrin repeats. Cell Tissue Biol 2:1–12

Wang Y, Guo X (2007) Development and characterization of EST-SSR markers in the eastern oyster Crassostrea virginica. Mar Biotechnol 9:500–511

Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y (2010) De novo assembly and characterization of root transcrip-tome using Illumina paired-end sequencing and development of cSSR markers in sweet potato (Ipomoea batatus). BMC Genom 11:726

Wang S, Wang X, He Q, Liu X, Xu W, Li L, Gao J, Wang F (2012) Transcriptome analysis of the roots at early and late seedling stages using Illumina paired-end sequencing and development of EST-SSR markers in radish. Plant Cell Rep 31:1437–1447

Wang L, Yang C, Song L (2013) The molluscan HSP70s and their expression in hemocytes. ISJ 10:77–83

Wang W, Hui JH, Chan TF, Chu KH (2014) De novo transcriptome sequencing of the snail Echinolittorina malaccana: identification of genes responsive to thermal stress and development of genetic markers for population studies. Mar Biotechnol 16:547–559

Werner GDA, Gemmell P, Grosser S, Hamer R, Shimeld SM (2012) Analysis of a deep transcriptome from the mantle tissue of Patella vulgata Linnaeus (Mollusca: Gastropoda: Patellidae) reveals biomineralising genes. Mar Biotechnol. doi:10.1007/s10126-012-9481-0

Wu S-P, Hwang C-C, Huang H-M, Chang H-W, Lin Y-S, Lee P-F (2007) Land molluscan fauna of the Dongsha Island with twenty two recorded species. Taiwania 52:145–151

You FM, Huo N, Gu YQ, Luo M-C, Ma Y, Hane D, Lazo GR, Dvorak J, Anderson OD (2008) BatchPrimer3: a high throughput web application for PCR and sequencing primer design. BMC Bio-inform 9:253

Yue H, Li C, Du H, Zhang S, Wei Q (2015) Sequencing and De Novo assembly of the Gonadal transcriptome of the endangered Chi-nese Sturgeon (Acipenser sinensis). PLoS One 10:e0127332

Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, McCown B, Harbut R, Simon P (2012) Using next-generation sequencing approaches to isolate simple sequence repeats (SSR) loci in the plant sciences. Am J Bot 99:1–16

Zhang J, Liang S, Duan J, Wang J, Chen S, Cheng Z, Zhang Q, Liang X, Li Y (2012) De novo assembly and characterization of the transcriptome during seed development, and generation of genic-SSR markers in Peanut (Arachis hypogea L.). BMC Genom 13:90

Zhang L, Li L, Zhu Y, Zhang G, Guo X (2014) Transcriptome analy-sis reveals a rich gene set related to innate immunity in the East-ern Oyster (Crassostrea virginica). Mar Biotechnol 16:17–33

Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X (2014) Compari-son of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One 9:e78644

Zhao QP, Xiong T, Xu XJ, Jiang MS, Dong HF (2015) De novo tran-scriptome analysis of Oncomelania hupensis after Molluscicide treatment by Next-Generation Sequencing: implications for Biology and future snail interventions. PLoS One 10:e0118673

Zheng X, Pan C, Diao Y, You Y, Yang C, Hu Z (2013) Development of microsatellite markers by transcriptome sequencing in two spe-cies of Amorphophallus (Araceae). BMC Genom 14:490