The 11th Japan-Korea-China Bioinformatic Training Course...
Transcript of The 11th Japan-Korea-China Bioinformatic Training Course...
_________________________The 11th Japan-Korea-China BioinformaticTraining Course & Symposium_________________________
Soochow University, Suzhou, ChinaJune 17 – 18, 2013
< CKJ Bioinformatics Training Course 2013> 11th China-Korea-Japan
Bioinformatics Training CourseBig Data needs Big Idea:
What can we solve from genetic information?
Soochow University, Suzhou, China17 June, 2013
Takashi GojoboriCenter for Information Biology
National Institute of Genetics (NIG), Mishima, Japan
Ion PGM
Ion Proton
Item DescriptionRead Length and Speed 512 nanopores x 15bp/sec => ~7500 bp/sec
Read Accuracy 99.8%6 Hours Life Time 150 x 106bpApplied Currency /Blockage 60 picoamps to anywhere from 20-40 picoamps
No. of nanopore 2,000 nanopores / cartridge.Will become available in early 2013 containing over 8,000 nanopores.→Delivers a complete human genome in 15 minutes.
Sample Preparation Any user-derived sample preparation resulting in double stranded DNA (dsDNA) in solution is compatible with the system.
Amplification No sample amplification.Cost $900Commercialization Oxford Nanopore intends to commercialise GridION and MinION
directly to customers within 2012.
Nano Pore Oxford (2012)
5
From the Genome Revolution to Sequencing Revolution
Sequencing =>/ Genome sequencing/ Meta-genomics/ Gene Expression
Transcription- EST, SAGE, and CAGE)/ miRNA, functional non-coding RNAs, siRNAs
(Translational regulation)/ CHIP-Seq (CHIP-Chip, CHIP-Pet)/ PPI (Two hybrid System)/ Epi-genomics (Methylated sites)
- Problem -
•Sydney Brenner says,“Low input, High throughput, and No output”!
6
Cells, Tissues,OrgansSpeciesPopula-
tions
Next-generationsequencers
Sequencedata
NGSBio-samples Database
Data AnalysisInformatics
/Cells, /Tissues,/Organs/ Species
/Popula-tions+
/Time/Environ-
ments/Conditions
/Genome /Meta-
genome/Epi-genome
/RNA-seq/CHIP-seq
/PPI/Synthe-tic
NGS
Bio-samples DatabaseData AnalysisInformatics
Hospitals and Clinical Stations(applied fields)
Basic Science andBiomedical Sciences
(Life Science Dept and Medical School at University)
Hospitals and Clinical Stations(applied fields)
Basic Science andBiomedical Sciences
(Life Science Dept and Medical School at University)
Issues on retrieving the necessary informationLack of the standard format without unified information often hinders research and development seriously
DBDB
Medicine
How about the relations between each information?
Where is the information needed?
DB DB
LabExperiment
DB
Too many databases….
Protein Structure
User
Journals
?
DB
Sequencedata
?
?
(The Gist)
“Submission only” DatabaseTraditional Model
Be Unified and Easily Retrievable Format from Sporadic Information !
“Unified and Easy Retrievable” Database
New Model
Which one to use?
?
This is a right one to use!
Databases with various formats
D. Howe, M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide, D. Hill, R. Kania, M. Schaeffer,
S. St Pierre, S. Tweigger, and S. Rhee Nature (2008) 455: 47-50
We conducted Japanese Governmental projects.
I. FANTOM1~5 Project (1999~)sponsored by RIKEN/DDBJ with MEXT (Ministry of Education, Science, Sports, and Culture).
II. H-Invitational Human Transcripts Project (2000~)sponsored by METI (Ministry of Economy, Trade, and Industry) and JBiC.
III. Human Genome Network Project (2004~2009)sponsored byMEXT.
IV. Cell Innovation Project (2009~2014)sponsored by MEXT .
IV. Structural Life Science Project (2012~2017)sponsored by MEXT .
14
15
For mouse full-length cDNAs,Nature (2001) 409:685-690Nature (2002) 420: 563-573Science (2005) 309: 1559-1563Nature Genetics (2009) 41: 553-62
Co-organized by JBIRC and DDBJAttended by more than 118 people from 40 organizations such as
JBIRC, DDBJ, NCBI, EBI, Swiss-Prot/SIB, SangerInstitute, NCI-MGC, DOE, NIH, DKFZ, CNHGC(Shanghai), RIKEN, Tokyo U, MIPS, CNRS, MCW, TIGR, CBRC, Murdoch U, U Iowa, Karolinska Int., WashU, U Cincinnati,
Tokyo MD U, KRIBB, South African Bioinfor Inst, U College London, Reverse Proteomics Res. Inst., Kazusa DNA Inst, Weizmann Inst, Royal Inst. Tech. Sweden, Penn State U, Osaka U, Keio U, Kyushu U, TIT, Ludwig Inst.
Brazil, Kyoto U, German Can.Inst., and NIGSupported by
JBIC, METI, MEXT, CRNS, NIH, and DOE
H-Inv
NIH
DKFZ
Tokyo U・NEDO
Kazusa
Helix Inst.
Shanghai
41, 118 FLcDNA clones21, 037 Loci (possible genes)5, 155 New loci vs RefSeq.
Nature (2002) 419: 3-4 PLoS Biol (2004) 2: 1-21Science (2004) 304: 368
18
Nature (2002) 419: 3-4 News
September 5, 2002
This will be a real human gene catalogue – not predicted from the human genome sequence…
19
H-InvDB Annotation Summary (rel 8.0)
Functional Annotation Category Number*
protein coding 36,096I: Identical to known human protein (experimentally validated)
15,141
II: Similar to known protein 6,532
III: InterPro domain-containing protein 1,026
IV: Conserved hypothetical protein 1,744
V: Hypothetical protein 5,813
VI: Hypothetical short protein (20-80 aa) 5,840
VII: Pseudogene candidate (transcribed) 752
non-protein-coding 8,329
*representative transcriptsCategories I, II and III (known function) define a reliable set of human protein-coding genes (22,699 genes).
Current number:~ 36,000.
Distribution of tissue_type origin(H-InvDB_8.0 representative transcripts)
*Total 694 tissue_types for 27,819 transcripts.
Statistics of human gene polymorphisms
SNP
5’UTR 3’UTR
41,369 95,496 * 85,423
Synonymous
Nonsynonymous40,484
53,754
1258**
1993 1474 4926
Total
8393
207,374
215,767 Total 43,369 96,970 90,349
Nonsense(Stop codon) Extension
75 **
Indel
ORF
Non- frameshifting 180Cause frameshift 12895’UTR-inORF 2 inORF-3’UTR 3
Termination Codon
Synonymous123
• dbSNP build 125, human genome assembly build 35 and H-InvDB release 3 (representative transcripts of protein-coding genes) were used.
• * includes 311 unclassified SNPs• ** allele that matches the cDNA sequence was considered to be ancestral
Yamaguchi-Kabata, et al. (2008)
22
H-InvDB Advanced Search ToolSearch Conditions Advanced Search Main Window
Three sets of data in H-InvDB1. representative transcripts: 46,4992. representative alternative splicing variants: 61,4733. all transcripts: 296,912
23
H-InvDB Enrichment Analysis Tool (HEAT)
Execution Button
HEAT is a data-mining tool for finding common features that is enriched in a given human gene set.
Annotation items analyzed:- Chromosomal band- Gene family/group- Gene Ontology (GO)- Functional domains (InterPro)- Structural domains (SCOP)- Metabolic pathways (KEGG)- Subcellular localization
(Wolf PSORT)- Tissue-specific gene expression
(H-ANGEL)- Sequence motifs in promoter
regions (JASPAR)
Knowledge discovery tool from H-InvDB
URL: http://hinv.jp/HEAT/
(Insert Gene List Here)
24
Genome Network Project
Human CAGE Tag 46,205,347Tags
PARK7
HTT
PARK2
PSEN1
APOE
PARK1 ALS1
ALS2
ALS4
PSEN2
APP
An integrated encyclopedia of DNA elements in the human genome.
ENCODE Project Consortium, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder MCollaborators (594)
Nature (2012)489(7414):57-74.
Abstract
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Susumu Ohno(1928 – 2000)
Junk DNA• Term coined in the article “So much ‘junk
DNA’ in our genome”
(Brookhaven Symposium on Biology, 1972)
Cell Innovation Project (2009~2014)
Cell Innovation Project (2009~2014)Data Analysis Center‐Gojobori’s Lab
Data Analysis Center
Sequencing Center
Unify management betweentwo centers・LSA Cooperation/Data Transfer・Joint research & development
Public
Co‐Research Institute
Strengthen cooperation with Co‐research Institute【e.g.】・Developing Miyano lab. “CelliP”
Provide Information・starting up “Wiki”
NGS First Timer
NGS Experienced one
Above all things, Result!・Analysis(DAP)・Progress(DTM‐IF)・Reading(GE)
Want to analyze freely!・WKT
Want some more advanced analysis!・Individual support【e.g.】・Miyano Lab. “CelliP”Technical Researcher
Leading Research Project
Basic Research・NGS Public data research・Assemble research etc.
Activities Research/Investigation
Establish analysis flow・Bisulfite analysis etc.
Research individual life phenomena・Various collaborative research etc.
Integrate NGS data &knowledge, and spread them out globally
Develop Know‐how【e.g.】・Ito lab. “Bisulfite”
Development
30
NGS Data Analysis Software
Class Tools 導入
公開
備考
マッピング
BWA ◯ ◯ CP
Bowtie ◯ ◯ CP
SOAP2 ◯ ◯ CP
FAMSR(Original)
◯ ◯ CP
MLA(Original)
◯ ◯ CP
Maq ◯
SeqMap ◯ ◯ P
BLAST+ ◯ ◯ P
mpiBLAST ◯
BLAT ◯
LAST ◯
BSMAP ◯ ◯ P
BFAST ◯
TopHat ◯ ◯ P
TopHat‐Fusion
◯
MUMMER ◯
「◯:Done、△:Underway」 「C: CONG、P:Pipeline」 ■:2011 AdditionsClass Tools 導
入公開
備考
アセンブル
Velvet ◯ ◯ C
ABySS ◯
SOAPdenovo
◯ ◯ C
ALLPATH ◯
Hassp(独自)
◯ ◯ C
SSAHA2 ◯
PCAP ◯
IMAGE ◯
CLUSTALW ◯
RNA‐seq
rSeq ◯ ◯ CP
ERANGE ◯
Cufflinks ◯ ◯ P
RNA‐seq*denovo
Trans‐ABySS
◯
Oases ◯
Trinity ◯
small‐rna
mireap ◯
Class Tools 公開
備考
ViennaRNA
ChIP‐seq
ISOLATE
PeakSeq
MACS
QuEST
SISSRs ◯ C
FindPeaks
GPS
BS‐seq
Bismark ◯ P
rrbsmap
Exome
SOAPsnp
GATK
ANNOVAR ◯ P
Utility
FASTX‐Toolkit
◯ CP
SAMtools ◯ CP
Picard
EMBOSS
Class Tools 導入
公開
備考
dnaa ◯
BEDtools ◯ ◯ CP
FastQC ◯ ◯ P
NGS附属
Corona Lite
◯
CASAVA ◯
BioScope ◯ △ P
Helisphere ◯
Newbler ◯
分担機関作成ツール
エピゲノム解析(宮野研)
△
Cellip(宮野研)
△
イメージアノテーション
解析(豊田研)
△
オミックス情報統合
解析(豊田研)
△
31
Analysis Pipeline System• We developed a pipeline system which automatically run routine analysis
– This system provide rapid analyses by parallel processing with multiple servers.– Prevent unnecessary file copy and use the storage feature properly to meet the different needs
Cooperation Analysis Pipeline Utilizing Data Analysis
Result yCloud
Analysis yOrder‐made Analysis
Sharing Information
Classification Pipeline name Description
Mapping
BWA Program: Fast and enough reliable mapping with few InDel and mismatches
SOAP2 Mapping program: Very fast when the number of query is big. Start‐up cost is high.
Bowtie Mapping program: Very fast mapping without detecting InDel. Start‐up cost is low.
FAMSR Pipeline: Browse the result of considering splice site by genome view
MLA Pipeline: Search mutation of insertion/deficiency and browse them by genome view
RNA‐seq rSeq Pipeline: Calculate the expression of known gene models by RPKM method
ChIP‐seqSISSRs ChIP‐Seq anal Pipeline: Detect binding site and browse them by genome view
SISSRs: filtering Pipeline: Make list of specifig binding sites and directory link it to browser.*
BisulfiteBSMAP: Mapping Pipeline: Identify methylated base and browse them by genome view*
BAMAP: Window anal Pipeline: Report highly methylated genomic regions.*
List of available pipeline*Under development(to be released in Oct.)
32
Construction of ChIP‐seq Data Analysis Pipeline
For ChIP‐seq
• Sharing ChiP‐seq analysis work flow (Built into analysis pipeline)
Bowtie Map Results (SAM) SISSRs BS Results
(TSV)
WindowAnalysis
Report(HTML)
Data Base(PG SQL)
DBLoader
Sequence(FASTQ)
Base view of binding regionBinding region report
Cooperation Analysis Pipeline Utilizing Data Analysis
Result yCloud
Analysis yOrder‐made Analysis
Sharing Information
33
Binding region judged by SISSRs
Sticking +strand lead
Sticking – strand lead
Hyper Link
For Bisulfite
Construction of Bisulfite Data Analysis
• Sharing Bisulfite analysis work flow (Built into analysis pipeline)
BSMAPGenome
Map Results (BSMAP)
Repeat Filter
Map Results(BSMAP)
Calc.Meth Rate
Results(TSV)
WindowAnalysis
Report(HTML)
Data Base(PG SQL)
DBLoader
Sequence(FASTQ)
Ability to display methylated baseUtilizing Window and center data
Hyper Link
Cooperation Analysis Pipeline Utilizing Data Analysis
Result yCloud
Analysis yOrder‐made Analysis
Sharing Information
34
T C T C A C A A G G T A C AT C T C A C A A G G T A C A
Methylated base
System to view analysis result• Developed the viewer to check the status of genome mapping
– display multiple mapping result at high speed and in parallel
DEMOClick Here
Link to GNP web.
Base display function
Magnify
Base display feature for confirming deletion, insertion, substitution and methylated base
Cooperation Analysis Pipeline Utilizing Data Analysis
Result yCloud
Analysis yOrder‐made Analysis
Sharing Information
35
Utilization of KEISuper‐Computer
Biological hierarchyEcosystem
|Population
|Individual
|Organ
|Tissue
|Cell
|Organella
|Bio-molecules
|Molecules
(Environments)
(Human population)
(Human)
(Lung, Stomach)
(Epidermal tissue)
(Red blood cell)
(Mitochondria)
(DNA, RNA, Proteins)
(H2O, O2)
Inte
grat
ion
Evolution
Information ExplosionIn Life Science
Beyond the 4th Paradigm proposed by Jim Gray
・・・・・・・・・・・・・・・・2010 2020 2030 2040
11!st Paradigm: Experiments
112nd Paradigm: Theory
113rd Paradigm: Simulation
11
11
4th Paradigm:Data‐driven Scientific Discovery
5th Paradigm:Data‐driven Scientific Innovation
Jim Gray
・Info for global environments
・Info for genomes・ Info for outer space・ Info for oceans・ Info for the earth・・・・・
InformationExplosion
Scientific Innovation and its Application to the Society
Genome Information Society
40
Acknowledgements/ Genome Network Project Consortium
- RIKEN (Hayashizaki’s group)- Tokyo U. (Sugano’s group)- Hitachi Co. Ltd.- Kieo U. (Yanagawa’s group)- Saitama Medical College (Okazaki’s group)
/ H-Invitational Consortium/ BIRC at AIST, Tokyo, Japan/ NIG – Genome Network Platform group
Thank you!