High-throughput Biological Data The data deluge and bioinformatics algorithms
Bioinformatics databases and applicationsrshamir/algmb/02/Rubin.BioinfoApplication.pdf · Eitan...
Transcript of Bioinformatics databases and applicationsrshamir/algmb/02/Rubin.BioinfoApplication.pdf · Eitan...
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Bioinformatics databases andapplications
Eitan Rubin, December 2002
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Outline
• Introduction
• A day in the life of a biologist
• Major databases
• Major tools
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Outline
• Introduction• A day in the life of a biologist
• Major databases
• Major tools
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Life as a simple CS problem
Algorithm
Input1
Input2
Output
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
A more realistic view
Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
A typical real-life view
Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision
Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision
Algorithm1
Input1
Input2
Output
Algorithm2
Algorithm3
decision
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
The life cycle of a bioinformaticsproject
• Clearly define the goals
• Define a strategy
• Run the process
• QA & optimize– Controls
– External knowledge
– Re-sampling
– Correlation
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Outline
• Introduction
• A day in the life of a biologist• Major databases
• Major tools
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Positional cloning of disease X
XM-417-L15XM-417-L16
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Genome browser @ UCSCLooking at the region of interest
Gene prediction program suggest there are 6-8 genes in the region
chrX:98100000-98500000
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
>unkown_proteinMRLTEKSEGEQQLKPNNSNAPNEDQEEEIQQSEQHTPARQRTQRADTQPSRCRLPSRRTPTTSSDRTINLLEVLPWPTEWIFNPYRLPALFELYPEFLLVFKEAFHDISHCLKAQMEKIGLPIILHLFALSTLYFYKFFLPTILSLSFFILLVLLLFIIVFILIFF
Get mRNA @ NCBI
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
BLAST @ NCBI
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Search for domains @Interpro
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Search for domains @Interpro
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Get predicted protein @ UCSC
>naharu.bMSSRKQGSQPRGQQSAEEENFKKPTRSNMQRSKMRGASSGKKTAGPQQKNLEPALPGRWGGRSAENPPSGSVRKTRKNKQKTPGNGDGGSTSEAPQPPRKKRARADPTVESEEAFKNRMEVKVKIPEELKPWLVEDWDLVTRQKQLFQLPAKKNVDAILEEYANCKKSQGNVDNKEYAVNEVVAGIKEYFNVMLGTQLLYKFERPQYAEILLAHPDAPMSQVYGAPHLLRLFVRIGAMLAYTPLDEKSLALLLGYLHDFLKYLAKNSASLFTASDYKVASAEYHRKAL
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Outline
• Introduction
• A day in the life of a biologist
• Major databases• Major tools
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
AAAAA
INSD (genbank, EMBL, DDJB)Specialized databases: Flybase, YPD, UCSC,TAIR
EPD
???
StackDB; Gencarta; Ensembl
HSSP
PDB
BIND; MINT; BRITE …
Swissprot ; interpro; LAMA; GO
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
INSD
• Genbank, EMBL, DDJB
• CleanBank
• Divisions (EST, HTG)
• Specialized databases
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Major tools
• Transcript modelling from ESTs– Sequencher, Staden, StackPACK
• Database searching– Blast
– BLAT
– Fasta
• Multiple Sequence Alignment– ClustalX
– MACAW
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Major tools
• Gene prediction• (EST) assembly• Promoter Finding• ORF identification• Similarity searching• MSA• Phylogenetic analysis• Structure prediction• Docking
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
ClustalX
• Stepwise tree-guided alignment
• “Bag full of tricks”
• Demo
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
The effect of parameters
Modified parameters
Default parameters
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
The effect of parameters
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Major tools
• Gene prediction• (EST) assembly• Promoter Finding• ORF identification• Similarity searching• MSA• Phylogenetic analysis• Structure prediction• Docking
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
Similarity searching
• SW (accelerated)• BLAST
+ The NCBI environment, Fast, wide dynamicrange, availability
- DNA very bad stats, poor for proteins? Highly local FASTA
• BLAT+ Lightening fast, focused- Limited dynamic range
Eitan Eitan RubinRubin Bioinformatics & Biological Computing UnitBioinformatics & Biological Computing UnitDepartment of Biological ServicesDepartment of Biological Services
MSA
• ClustalX+ Fast; familiar- Global; One, not very accurate algorithm
• Macaw+ Very interactive; outstanding GUI; multiple
algorithms- Immature; runs on PCs; incompatible
• BLOCKS maker+ Fully automated; fast- Poor control; many mistakes