Presentation at ZSJ 2013 by Shigehiro Kuraku
description
Transcript of Presentation at ZSJ 2013 by Shigehiro Kuraku
・
・
・
・
‘the complete set of phylogenetic trees derived from the proteome of an organism’
Sicheritz-Pontén and Andersson, 2001. Nuc. Acids Res. 29: 545
genome-wide events +
gene family-specific events
August 2012. At Daitoku-ji Temple, Kyoto
Hypothesis C Hypothesis A Hypothesis B
human
chicken
shark
lamprey
hagfish
amphioxus
tunicate
Cyc
lost
om
es
human
chicken
shark
lamprey
hagfish
amphioxus
tunicate
human
chicken
shark
lamprey
hagfish
amphioxus
tunicate
- Composition of Hox/Dlx clusters Neidert et al., 2001. PNAS Irvine et al., 2002. J Exp Zool B Force et al., 2002. J Exp Zool B etc
- ParaHox clusters Furlong et al., 2007. MBE
- Mol. phylogeny of 33 gene families Escriva et al., 2002. MBE
- Amphioxus genome Putnam et al., 2008. Nature
Cyc
lost
om
es
Cyc
lost
om
es
- Mol. phylogeny of 55 gene families
Kuraku et al., 2009. MBE
- Sea lamprey genome analysis
Smith, Kuraku et al., 2013. Nature Genetics
- Globin gene phylogeny
Hoffmann et al., 2010. PNAS
Heuristic ML JTT+G4
ML-BP/NJ-BP
Kuraku and Kuratani, 2011
(Kuraku & Kuratani, 2011. Genome Biol. Evol.)
(cf. hidden paralogy)
Informatics Modern sequencing
Molecular Developmental Biology
Genome Resource & Analysis Unit Center for Developmental Biology
RIKEN, Kobe, Japan
illumina HiSeq1500
Installed in November 2011
~150 bp reads in Rapid Run mode
Sanger sequencing, Cell sorting with FACS, clone distribution, etc.
Kuraku et al., 2013. Nucleic Acids Res. Amemiya et al., 2013
Not only sequencing
・
・
・
・
・Main applications: RNA-seq & ChIP-seq
・Trouble shooting with tight wet-dry communication
・Diverse non-model organisms for RNA-seq
Our experiences at GRAS
・Many requests with limited sample amounts
・Look carefully for acceptable pricing and service contents
・Longer illumina reads are not necessarily beneficial
・Sequencers ‘can’ produce ‘data’ from problematic samples
Low quality DNA/RNA, contamination, over-amplification, …
For retrieving complete genome and original transcriptome
~150bp on HiSeq & ~300bp MiSeq (as of September 2013)
Prep of libraries with longer inserts
e.g. How many reads do you need?
・
・
・
・
Species Sequenced at
Gene model by Sequencing technology
Published in # of authors
Started in
sea lamprey
Wash. Univ. Yandell lab / Ensembl
Sanger Nat. Genet. (2013)
59 2005?
soft-shelled turtle
BGI BGI / Ensembl illumina Nat. Genet. (2013)
34 2010
coelacanth Broad Institute
Broad / Ensembl illumina Nature (2013)
91 2011
International consortium
Vertebrate ‘new genes’
GC & codon usage bias
Myelin-associated genes
Smith, Kuraku, et al. 2013. Nature Genetics
Sequenced at Wash. Univ. Genome Institute
In-house annotation effort
Horizontal gene transfer
Kuraku et al., 2012. Genome Biol. Evol.
GC-content & codon usage bias
Qiu et al., 2011. BMC Genomics
Trained gene prediction setting available at Augustus web server
Contributed analysis
http://www.ensembl.org/Petromyzon_marinus/Info/Index
Coding genes: 10,415
Incomplete genome assembly: Pax6 missing
Incomplete gene annotation: Fgf8/17-A missing
(as of September 2013; release 73)
Amino acid composition
Deviation of ‘gene model’ in lamprey genome
Smith, Kuraku, et al. 2013. Nature Genetics
Methods: Correspondence analysis for frequencies of 20 amino acids
CA
CA
Codon usage bias
Heavy use of GC-rich codons
sea lamprey stickleback Tetraodon
Takifugu platypus medaka
dog human mouse
ghost shark zebrafish
chicken anole lizard
opossum X. tropicalis
Methods: RSCU (Sharp et al., 1986) and ENc (Wright, 1990)
Qiu et al., 2011. BMC Genomics
N
Genomic DNA
Raw reads
Genome assembly (contigs/scaffolds)
‘Gene model’ (protein-coding sequences)
Sanger, 454, illumina, or/and PacBio
Gene prediction (after ‘training’)
Heterochromatin etc.
Repeats, regions with low depth
‘Unusual’ genes
Assembly
Reference: transcriptome, annotated genes in GenBank
Genomic DNA
Raw reads
Genome assembly (contigs/scaffolds)
‘Gene model’ (protein-coding sequences)
Sanger, 454, illumina, or/and PacBio
Gene prediction (after ‘training’)
Assembly
Reference: transcriptome, annotated genes in GenBank
(cf. Assemblathon2 - Bradnam et al., 2013)
‘NG50’ instead of N50
CEGMA (Parra et al., 2007) – coverage of CEGs
CGAL, REAPR, ALE – evaluation by identifying misassemblies
QUAST – computation of assembly summary
248 core eukaryotic genes (CEGs)
Species Assembly release # of CEGs found (including ‘partial’)
Published?
human GRCh37 (hg19) 248 First draft in 2001
mouse GRCm38 (mm10) 239 First draft in 2002
X. tropicalis JGI_4.2 239 Hellsten et al., 2010
coelacanth LatCal1 236 Amemiya et al., 2013
spotted gar LepOcu1 235 unpublished
soft-shell turtle PelSin_1.0 232 Wang et al., 2013
anole lizard AnoCar2.0 231 Alföldi et al., 2011
zebrafish Zv9 230 Howe et al., 2013
chicken galGal4 220
chicken WASHUC2.63 (galGal3) 210 First draft in 2004
Japanese lamprey LetCam1 199 Mehta et al., 2013
sea lamprey PerMar1 172 Smith et al., 2013
little skate version2 77 unpublished
elephant shark (1.4x) 58 Venkatesh et al., 2007
Genomic DNA
Raw reads
Genome assembly (contigs/scaffolds)
‘Gene model’ (protein-coding sequences)
Sanger, 454, illumina, or/and PacBio
Gene prediction (after ‘training’)
Assembly
Reference: transcriptome, annotated genes in GenBank
(cf. Assemblathon2 - Bradnam et al., 2013)
‘NG50’ instead of N50
‘Annotation Turnover’ and ‘AED’ (Eilbeck et al., 2009)
Also, run CEGMA to check transcript diversity?
CEGMA (Parra et al., 2007) – coverage of CEGs
CGAL, REAPR, ALE – evaluation by identifying misassemblies
QUAST – computation of assembly summary
– Nakamura et al., 2013
・
・
・
・
- Phylogenetic property of the species of your interest
e.g. Ploidy level, distance to close relatives, …
www.genomesize.com, www.timetree.org
- Any clue about its molecular attributes ?
e.g. GC-content, repeats, intron/UTR length, …
Using existing resources at SRA & Sanger traces at NCBI dbEST
- Genome or transcriptome to sequence ?
- Sample prep mostly determines the fate of the project
Any existing or emerging resources?
Quantification with Qubit; rRNA removal controlled with BioAnalyzer
Replication > Depth (Rapaport et al., 2013. Genome Biol.)
- Rigorous QC of prepared libraries before sequencing ChIP-qPCR before ChIP-seq
- RNA-seq: sequence identification or quantification?
- Fostering more productive sequencing facilities in Japan
GRAS accepts visits of facility managers/staffs
- Education of researchers with dual (wet/dry) capabilities
Learning material: ‘Unix & Perl for Biologists’ by Korf Lab
‘A sequencer or a bioinformatician ?‘
http://korflab.ucdavis.edu/unix_and_Perl/
- Importing latest information from overseas