Bioinformatics
description
Transcript of Bioinformatics
Bioinformatics
Overview
School of B&I TCD May 2010
Who, me?• Andrew Lloyd
• 087-225-9850, 053-9255717, 01-896-2450
• Director INCBI 1993-2000
• Population genetics, evolution
• Whole genome analysis
• Immunology, chickens, FIRM
Definition/scope
• Storage, retrieval and analysis of biological (sequence) information.
• Insert better definition here• Case can be made for microarray analysis• NOT
– ecoinformatics (ecology)– Image analysis– Bar-coding hospital sheets
Philosophy
“Nothing worth learning can be taught” Oscar Wilde
Getting bioinformation
• Type it in: A,T,C,C,G,T,C,A (1991)
• Access databases– Literature (Pubmed)– Medical (OMIM)– DNA sequence (EMBL/GenBank)– Protein sequence (UniProt, SwissProt, PIR)– 3-D structure (PDB)
Annotation
• In any DB, half is data and half context.– Gene ontology (language)– Parsing sequence (ORF, RBS, Intron, -helix)– Recognising similar sequences (evolution!)– Complementary info : DB cross-referencing
• (DNA -> Protein -> 3D structure -> motifs)
Secondary databases
• Protein motifs, domains, families
• RNA structures (16S ribosomal RNA…)
• Taxonomy/classification
• Metabolic pathways (KEGG)
• Enzymes (Brenda, TCD, Ireland)
• SNPs: mutations and variants
• Disease DBs (OMIM)
• Immuno, epitope DBs
Complete genomes
• Ensembl (complex, basically vertebrate)– Uniform look-and-feel; cross-refs
• UCSC GoldenPath browser
• Plants
• Bacterial genomes– Including mitochondrial, chloroplast– Eubacteria vs Archaea vs Eukaryotes
Annotated/known genes
• What does my gene do?
• Blast (fasta) against the DB
• SRS/Entrez to access databases– Neighboring (similar things in same DB)
• DB cross-references– full picture of attributes– What biochemical pathway?
UniProtProtein sequence
GenBank/EMBLDNA Sequence
PDB3-D struct
OMIM
PubMed
Taxonomy
Maps &Genomes
FullTextJournals
Prosite Pfam PSSM
The territory
Databases
•BIG
• EMBL/GenBank 200Gbp, 100m entries, 2500 complete genomes, 200K species
• Encycl. Britannica 180m letters. 40m words• EMBL 1km of Britannica Volumes• Doubling every 14-18 mo• Human genome is X bp?
Intrinsic vs Context
Internal• DNA, protein sequence
– DNA: Purine/Pyrimidine– AAs: small, hydrophobic, aromatic, polar– Variants: SNPs, Indels, Alt Splicing
• 2ndry structure– DNA: stem/loops– Protein: helix, sheet, turn, loop
Intrinsic vs Context
External, context for your molecule
• In other species (homologs, phylog trees)
• In which cell
• In which cellular location (GO)
• Molecular complex (dimers)
• Which pathway (KEGG)
• Where in genome (neighbors, synteny)
New Unknown Gene
• Blast homology searching
• Genomic location/neighboring genes
• Where is it expressed?
• How regulated (control sequences)
• Intron/exon structure
• Domain structure
• Restriction sites etc.
• Primer design
DNA/gene structure
• Four bases A T C G U– 2 pyrimidine, 2 purine– LOTS of them: how many?
• Open reading frame
• 5’ signals, 3’ signals
• Introns/exons
• Neighbours (operons)
Two sequences
• Alignment– Local– Global
• Dotplot
• Threading
One seq vs many
• Homology search vs database
• Special case of 2-seq alignment
• Blast vs fasta
• Limit by species/taxon
• Substitution matrices
• Low complexity masking
Multiple sequence alignment
• MSA
• Progressive alignment
• ClustalW or (better) T-Coffee
Phylogenetic trees
• Computationally intensive
• Distance matrix methods– Neighbor-joining (NJ)– UPGMA
• Minimum evolution
• Maximum parsimony
• Maximum likelihood– Bayesian methods
Genefinding
• Special case of DNA analysis• How to annotate a genome• Bacterial
– Find open reading frames (ORFs)– With start/stop codons– With promoter, RBS, CAAT, TATA
• Eukaryotic– As above PLUS– Introns/exons– Alternative splicing
Exon 1 Exon 2 Exon 3 Exon 4
StopStart (ATG) IntronsControlRegion
Typical mammalian gene structure
Introns “spliced out” and discarded
DNA
RNA
RNA
ATGCCCAGGAGATTTGGA . . .
PROTEIN MetProArgArgPheGly . . .
miRNAs?
5’ 3’gt.. …ag
Stop: TAG, TGA, TAA
Protein substructure• DNA makes protein and protein (enzymes)
make everything else.
• 20 Amino acids
• Amino acid properties
• Motifs
• Domains
• Biological units
Amino acid propertiesagain … and again and again
Protein 3-D structure
• Relationship between sequence & structure
• Secondary structure– Alpha helix– Beta sheet– Coil– Turn
• Threading sequence to homologous structure
Gene Expression
• EST
• SAGE
• MicroArray
• Clustering of same expressed genes
Genomics
• Complete DNA seq for a species
• Gene order
• Gene clusters/operons– Missing operons
• Gene duplication
• Whole genome duplication (WGD)
SNPs
• Key issue in genetics is that two organisms are both the same and different:– Humans vs chimps vs mouse– Parent vs offspring vs co-national vs human
• Single nucleotide polymorphisms• Variation between individuals• Pharmacogenetics
– Personal tailored medicine
Summary/take home
• Course designed to give you access to databases, software tools
• …and ways of thinking about data