Bioinformatics. introduction molecular biology biotechnology bioMEMS bioinformatics ...
-
Upload
catherine-kelley -
Category
Documents
-
view
228 -
download
1
Transcript of Bioinformatics. introduction molecular biology biotechnology bioMEMS bioinformatics ...
bioinformatics
introduction molecular biology biotechnology bioMEMS bioinformatics bio-modeling cells and e-cells transcription and regulation cell communication neural networks dna computing fractals and patterns the birds and the bees ….. and ants
course layout
book
Introduction to Computational Molecular Biology
introduction
DNA
DNA
central dogma
definitions
Informatics the science of information management
Bioinformatics the science of biological information management
what is bioinformatics?
Bioinformatics is Multidisciplinary
ComputerScience
Math
Statistics
StructuralBiology
Phylogenetics
Drug Design
Genomics
MolecularBiology
interdisciplinary
increasing levels of complexity
Genome (DNA)
Transcriptosome (RNA)
Proteome (proteins)
Metabalome (metabolic pathways)
1 2 3 5 10 16 24 35 49 72 101 157217
385652
1,160
2,009
3,841
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
Millions
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Source: GenBank
GenBank basepair growth
growth of biological databases
growth of biological databases
3D structures growthhttp://www.rcsb.org/pdb/holdings.html
symbol meaning explanation
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R A or G puRine
Y C or T pYrimidine
N A, C, G or T aNy base
U U Uracil
DNA/RNA
some definitions use of computers to catalog and organize molecular life science informa
tion into meaningful entities. subset of computational biology
Methods to analyse, store, search, retrieve and represent biological data by computers /in computers
massive amounts of data: databases extracting information and knowledge from "raw" data for most bioscientists, all they need in bioinformatics is
sequence analysis
definitions of bioinformatics
bioinformatics is not just the storage of data in a computer.
bioinformatics is the use of computers to test a biological hypothesis prior to performing the experiment in the laboratory.
bioinformatics is the design of software programs that analyse data.
what does it do?
nucleotide and protein sequences protein structures all sorts of functional data related to genes, proteins
and their regulation, interactions etc. curated and non-curated databases
bioinformatics databases
sequence searching and sequence alignments looking at properties that can be analyzed/predicted
from sequence data protein structures and their analysis structural classification visualisation of macromolecules ”system-wide” understanding of the biology of a given
organism
some goals
genomes and their annotation
complete genomes of many organisms are available seeing ”parts lists” of everything an organism needs
and figuring out how they work together
annotation: looking at the DNA sequence
genomes and their annotation
gene finding is not always straightforward problem: rare gene products, for which you cannot
find corresponding mRNA or protein sequences in databanks
additional complication: alternative splicing, many transcripts per gene
genomes and their annotation
if you intend to analyze or just use data from a databankit is useful to know both the goals and the reality of their annotation level
inconsistencies, missing data even well-annotated databanks provide only a fraction
of all biologically relevant information relevant to a gene or a molecule (compared to literature)
annotation: a vision
databank content: all knowlegde on functions of a gene product add structural information
insights in structure-function relationships add data on expression patterns and regulation
understanding cell differentiation and other big questions in biology on molecular level
current -omics
metabolomics
“…to identify, measure and interpret the complex time-related concentration, activity and flux of metabolites in cells, tissues, and other bio-samples such as blood, urine, and saliva.”
systems biology
Integrated view of biology at multiple levels
Generation of quantitative, predictive models of the behavior of biological systems, such as organisms
bioinformatics in short
very short
common genes?
Application of information technology to the storage, management and analysis of biological information
Facilitated by the use of computers
what is bioinformatics?
what is bioinformatics?
Sequence analysis Geneticists/ molecular biologists analyse genome
sequence information to understand disease processes Molecular modeling
Crystallographers/ biochemists design drugs using computer-aided tools
Phylogeny/evolution Geneticists obtain information about the evolution of
organisms by looking for similarities in gene sequences Ecology and population studies
Bioinformatics is used to handle large amounts of data obtained in population studies
Medical informatics Personalised medicine
Nucleotide sequence file
Search databases for similar sequences
Sequence comparison
Multiple sequence analysis
Design further experiments
Restriction mappingPCR planning
Translate into
protein
Search for known motifs
RNA structure prediction
non-coding
coding
Protein sequence analysis
Search for protein coding regions
Sequencing project management
Protein sequence file
Sequence comparison
Search for known motifs
Predict secondary structure
Predict tertiary
structureCreate a multiple
sequence alignment
Edit the alignment
Format the alignment for publication
Molecular phylogeny
Protein family analysis
Nucleotide sequence analysis
Sequence entry
sequence analysis: overviewManual
sequence entry
Sequence database browsing
Search databases for similar sequences
gene sequencing
Automated chemical sequencing methods allow rapid generation of large data banks of gene sequences
Sequences producing significant alignments: (bits) Value
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)
Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + +Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233
Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+ESbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288
The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases
database similarity searching
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . .814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | |136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| |173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
sequence comparison
Gene sequences can be aligned to see similarities between gene from different sources
restriction mapping
Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes
AceIII 1 CAGCTCnnnnnnn’nnn...AluI 2 AG’CTAlwI 1 GGATCnnnn’n_ApoI 2 r’AATT_yBanII 1 G_rGCy’CBfaI 2 C’TA_GBfiI 1 ACTGGGBsaXI 1 ACnnnnnCTCCBsgI 1 GTGCAGnnnnnnnnnnn...
BsiHKAI 1 G_wGCw’CBsp1286I 1 G_dGCh’C
BsrI 2 ACTG_Gn’BsrFI 1 r’CCGG_yCjeI 2 CCAnnnnnnGTnnnnnn...CviJI 4 rG’CyCviRI 1 TG’CADdeI 2 C’TnA_GDpnI 2 GA’TCEcoRI 1 G’AATT_CHinfI 2 G’AnT_CMaeIII 1 ’GTnAC_MnlI 1 CCTCnnnnnn_n’MseI 2 T’TA_AMspI 1 C’CG_GNdeI 1 CA’TA_TG
Sau3AI 2 ’GATC_SstI 1 G_AGCT’CTfiI 2 G’AwT_CTsp45I 1 ’GTsAC_
Tsp509I 3 ’AATT_TspRI 1 CAGTGnn’
50 100 150 200 250
PCR primer design
Oligonucleotides for use in the polymerisation chain reaction can be designed using computer based programs
OPTIMAL primer length --> 20MINIMUM primer length --> 18MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000MINIMUM acceptable melting temp --> 57.000MAXIMUM acceptable melting temp --> 63.000MINIMUM acceptable primer GC% --> 20.000MAXIMUM acceptable primer GC% --> 80.000Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0
Alignment formatted using MacBoxshade
Sequences of proteins from different organisms can be aligned to see similarities and differences
multiple sequence alignment
E.coli
C.botulinum
C.cadavers
C.butyricum
B.subtilis
B.cereus
Phylogenetic tree constructed using the Phylip package
Analysis of sequences allows evolutionary relationships to be determined
phylogeny inference
large scale bioinformatics: genome projects
MappingIdentifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping
SequencingAssembling clone sequence reads into large (eventually complete) genome sequences
Gene discoveryIdentifying coding regions in genomic DNA by database searching and other
Function assignmentUsing database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted geneData miningSearching for relationships and correlations in the information
Genome comparisonComparing different complete genomes to infer evolutionary history and genome rearrangements
genomics
introduction to DNA microarrays
massive data sets from simultaneous expression levels of thousands of genes
impossible to grasp directly by the human mind methods are needed for finding meaningful results
and patterns from the bulk of data
DNA microarray bioinformatics
data manipulation: normalization etc. data clustering
genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.
cancer typing) data mining:
finding interpretation to clustering results example: recognition of regulatory factor binding sites in
coexpressed genes
hierarchy of relationships:
genome
gene 1 gene 3gene 2 gene X
protein 1 protein 2 protein 3 protein X
function 1 function 2 function 3 function X
basis of molecular biology
FERN 160,000,000,000LUNGFISH 139,000,000,000SALAMANDER 81,300,000,000NEWT 20,600,000,000ONION 18,000,000,000GORILLA 3,523,200,000MOUSE 3,454,200,000HUMAN 3,400,000,000 31,000Drosophila 137,000,000 13,500C. Elegans 96,000,000 19,000Yeast 12,000,000 6,315E. Coli 5,000,000 5,361smallest Genome ??????
genes
genome size
comparative genomics whole-genome analyses evolution studies analyses of components in a ”complete” system
functional genomics = inferring functions from data expression patterns, gene regulation sequence comparisons, homologue relationships studies of gene variation, altered phenotypes
genomics
gene finding is not always straightforward problem: rare gene products, for which you cannot
find corresponding mRNA or protein sequences in databanks
additional complication: alternative splicing, many transcripts per gene
even well-annotated databanks provide only a fraction of all biologically relevant information relevant to a gene or a molecule (compared to literature)
genomics
massive data sets from simultaneous expression levels of thousands of genes
impossible to grasp directly by the human mind methods are needed for finding meaningful results
and patterns from the bulk of data
DNA microarrays
data manipulation: normalization etc. data clustering
genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.
cancer typing) data mining:
finding interpretation to clustering results example: recognition of regulatory factor binding sites in
coexpressed genes
DNA microarrays
DNA array technology
Array TypeSpot Density
(per cm 2 )Probe Target Labeling
Nylon Macroarrays < 100 cDNA RNA RadioactiveNylon Microarrays < 5000 cDNA mRNA Radioactive/FlourescentGlass Microarrays < 10,000 cDNA mRNA FlourescentOligonucleotide Chips <250,000 oligo's mRNA Flourescent
spotting robot
microarray expression analysis
microarray
photolithography
array terminology
70 mer vs 40 mer Attachment
NH2
NH2
NH2
70 mer40 merTarget
microarray
microarray data
control mouse
a stressed mouse
RNAextraction
target labeling
image analysis
gene expression analysis
determination of expression levels
DNA micro-array
genomics
The application of high-throughput automated technologies to molecular biology.
The experimental study of complete genomes.
genomics technologies
Automated DNA sequencing Automated annotation of sequences DNA microarrays
gene expression (measure RNA levels) single nucleotide polymorphisms (SNPs)
Protein chips (SELDI, etc.) Protein-protein interactions
cDNA spotted microarrays
Affymetrix gene chips
microarray data analysis
Clustering and pattern detection Data mining and visualization Controls and normalization of results Statistical validation Linkage between gene expression data and gene
sequence/function/metabolic pathways databases Discovery of common sequences in co-regulated
genes Meta-studies using data from multiple experiments
microarray data analysis
impact on bioinformatics
Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets.
It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.
proteomics
what is proteomics?
The analysis of the entire protein complement
expressed by a genome, or by a cell or tissue type.“
Two most related technologies 2-D electrophoresis: separation of complex protein
mixtures Mass spectrometry: Identification and structure
analysis
Wasinger VC et al Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium.
Electrophoresis 16 (1995) 1090-1094
transcription
genomic DNA
Structure Regulation Information
Computers cannot determine which of these 3 roles DNA play solely based on sequence… (although we would all like to believe they can)
introduction to proteomics
Definitions Classical - restricted to large scale analysis of gene
products involving only proteins Inclusive - combination of protein studies with analyses
that have genetic components such as mRNA, genomics, and yeast two-hybrid
Don’t forget that the proteome is dynamic, changing to reflect the environment that the cell is in
1 gene is no longer equal to one protein 1 gene = how many proteins?
1 gene = 1 protein?
why proteomics?
Annotation of genomes, i.e. functional annotation Genome + proteome = annotation
Protein Function Protein Post-Translational Modification Protein Localization and Compartmentalization Protein-Protein Interactions Protein Expression Studies
Differential gene expression is not the answer
types of proteomics
Protein Expression Quantitative study of protein expression between
samples that differ by some variable
Structural Proteomics Goal is to map out the 3-D structure of proteins and
protein complexes
Functional Proteomics To study protein-protein interaction, 3-D structures,
cellular localization and PTMS in order to understand the physiological function of the whole set of proteome.
introduction to proteomics
composition of the proteome depends on cell type, developmental phase and conditions
proteome analyses are still struggling to solve the ”basic proteome” of different cells and tissues or limited changes under changing conditions or during processes
current methods can only ”see” the most abundant proteins
expression proteomics = differential proteomics = 2D-PE + MS
interaction proteomics functional proteomics = systematic perturbation or
functional inactivation of proteins in a given environment
structural proteomics
proteomics
typically a combination of 2D protein electrophoresis and mass spectrometry
labour-intensive, not really ”high-throughput” methods
more efficient ”protein array” methods are emerging
proteomics experiments
bioinformatics in proteomics
High-throughput determination of the 3D structure of proteins Goal: to be able to determine or predict the structure of every
protein. Direct determination - X-ray crystallography and nuclear magnetic
resonance (NMR). Prediction
Comparative modeling - Threading/Fold recognition Ab initio
structural proteomics
To study proteins in their active conformation. Study protein:drug interactions Protein engineering
Proteins that show little or no similarity at the primary sequence level can have strikingly similar structures.
why structural proteomics?
FtsZ - protein required for cell division in prokaryotes, mitochondria, and chloroplasts.
Tubulin - structural component of microtubules - important for intracellular trafficking and cell division.
FtsZ and Tubulin have limited sequence similarity and would not be identified as homologous proteins by sequence analysis.
an example
Burns, R., Nature 391:121-123Picture from E. Nogales
FtsZ and tubulin have little similarity at the amino acid sequence level
homologues
Yes! Proteins that have conserved secondary structure can be derived from a common ancestor even if the primary sequence has diverged to the point that no similarity is detected.
are FtsZ and tubulin homologues?
structure is function
protein structure
Imaging Experimental X-ray diffraction data
Predicting structure in silico from sequence
Make crystals of your protein 0.3-1.0mm in size Proteins must be in an ordered, repeating pattern.
X-ray beam is aimed at crystal and data is collected. Structure is determined from the diffraction data.
X-ray crystallography
http://www-structure.llnl.gov/Xray/101index.html
X-ray crystallography
Schmid, M. Trends in Microbiolgy, 10:s27-s31.
crystals
X-ray crystallography
Protein must crystallize. Need large amounts (good expression) Soluble (many proteins aren’t, membrane proteins).
Need to have access to an X-ray beam. Solving the structure is computationally intensive. Time - can take several months to years to solve a structure
Efforts to shorten this time are underway to make this technique high-throughput.
general process for proteomics research
Image Analysis
Digester
Spot picker
Gel hotel
Spotter
MS
2-D Gel
取材自台大微生物生化系莊榮輝教授網頁
general process for proteomics research
protein microarray
arrayIT TM
protein microarray
arrayIT TM
G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
what can protein microarrays do?
1. Protein / protein interaction2. Enzyme / substrate interaction (transient)3. Protein / small molecule interaction4. Protein / lipid interaction5. Protein / glycan interaction6. Protein / Ab interaction
1. G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
2. H.Zhu et al, 2001 Science 293:2101
3. Ziauddin J and Sabatini DM, 2001 Nature 411:107
protein microarrays (Antibody arrays)
the real world
The true spot quality compared to a real experiment
mobility of protein in an electric field
Mobility : Electrolytic molecules move in an electric field
Mobility ~[Electric field (mV)][Net charge of molecule]
[Friction between molecules and matrix]
2-dim electrophoresis
Digest to peptide fragmentMS analysis
2-D gel electrophoresis
First dimension denaturing iso-electric focusing separation according to the pI
Second dimension SDS-PAGE (Sodium Dodecyl
Sulfate coated in a Poly-Acrylamide Gel Electrophoresis)
Separation according to the molecular weight
pI is the iso-electric point of the protein
result example
Ion source: substance to ion gas Mass analysis: according to mass/charge (m/z) Detection: femtomole -attomole
Ion source Ion separator detector
mass spectrometry
++
++ ++
++ +
+
pulsed
UV or IR laser
(3-4 ns)
detector
vacuum
strong electric field
Time Of Flight tube
peptide mixture
embedded in
light absorbing
chemicals (matrix)
cloud of
protonated
peptide moleculesaccV
principle of mass spec
Linear Time Of Flight tube
Reflector Time Of Flight tube
detector
reflector
ion source
ion source
detector
time of flight
time of flight
principle of mass spec
typical result
Nuclear Magnetic Resonance Spectroscopy (NMR)
Can perform in solution. No need for crystallization
Can only analyze proteins that are <300aa. Many proteins are much larger. Can’t analyze multi-subunit complexes
Proteins must be stable.
structure modeling
Comparative modeling Modeling the structure of a protein that has a high degree of sequ
ence identity with a protein of known structure Must be >30% identity to have reliable structure
Threading/fold recognition Uses known fold structures to predict folds in primary sequence.
Ab initio Predicting structure from primary sequence data Usually not as robust, computationally intensive
sequence alignment
sequence alignment
©Ken Howard, Scientific American, July 2000
sequenceGATCAAACATTAAACATCCTGAGATCCAAAGGTAAGAGATCTAGCCACAGGGAGTGCTGGGGATTCGGGTCCTGGTGATCTTCACATGCTGACATAGCTCAGCCCTTTTTGGCCCTGGCTTTGTCCTGTTGTGGGCTTTCCCATCTGCAACCCATGCTCCTGGGCCATTTTCCTATGGGCCAGGGAAAACAAGATGGGGTGAAGGCACCCTTACATTTAGGGGCAAGACCTAGTACTCAGAAGGATTCAGAAACTGAAATAGCTGGGTGATACCACACAGGTGCTAGGGATAAGGGGCCTTGAGCCATGGACCATGGGAACTACAAAGCTGAAGGAGCTGCTGCCTCAGCAGAACCAGCGCTTGAATTTGTTCTTTCAGAACCTCAGTCTCTTCCTCTGAAAAATGGGTGTGTTGTGTATCCCACATTCCCAAGTCAGCCATGGGACCAAATGTGAGCGTGTGGGTTTTGCCTCCTGAGAAACTCAGGGGAGCAGAATGCTACAGTGGGTGAATTGGATTCTTTCAGAGAGCCCACCCTGTTTCCCACATCAGCCAGAAGGCTCAAAACCCTGAAGAGCTTTCTGAACTTTGAGGTGCCCAAAGCTTCAGGGCTGTATGGGAAGCACCTGAGGTCCAAGTCCGTTTACAAGAATTTTGTTTTTTGGTTTACAGCTGCTTGGCCGGTCCAAGGAGCAGGTTTGGGTCCTGTGCTCCACAGACCTAAGGGTTACCTTAGAGCTTATGGGAGAGCATTGTGTGTGGACAGTGGACAGTGCCCTCTAGTGCTCAGTGTTAGCACTACATCCAGTTGCCCTCCACCAGTTTATGCTGCTGAGGAAGTCTTTCTTTTCCCAACAGCAGTGTCTCTCCCTCTCCCACCCCCTCTCCCTCTCCCTCCCCCCCTAGGTTATTTTTATTTTTACTGGTGTGTATGTGTGTGAGTCTATGTCACATGTATGAGAGTGCTTGTGGAGACCAGAAGAGGGCATCAGAAGAGCCCCTAGAACTGGAGTATAGGTGGTTGTGAGCCACTTGTCATGGGTGTTGGGAACCAAACTCAGGTTCTCTGGAAGAACAACAAGCTCCCTTATCATATAAGCCATCTCTAAATCCAGGACATTTTTTTTTTTTTTTTTGAGATTTAGAGATTCAAGGAGGAGGAACAATAGGAGGAAGAAGGGGACAGAATAAGGCCAACAAAATGACCAAGGAGGTATAGGCACTTGAAGCCAAACCTAAGTACCTGAGTTCAATCCCTGGGACCCACATGATGGAAAGATGGAATCGATCCCCAAAAGTTATCTTCTGATCCCTATATGCACACACTTGAGGATGGACAGACAAAGAGACAGACACACAAACACACACAAATGTAACTGAAAAAGAAACCTCTATGGGGACATCGCCTTCTTGGAGAGGCTCTGTTGCCCCTCATCCTAGTGAACAAACAACTCCTACTCCCTGCCAGAGTATCCTACCCTTGGATTCAAAATGGTCTCAGAGGACACACCGGGTGGGCTCTGTCGCTGGGATCTTGCATAACCAATGCCCATAAGCCTGGCAAAGGTGGCGATGAGACGATAAGGTCAGGGACATGACCGCAGAAGAGGAGTGGGGACGCGATGAGTGGGAGGAGCTTCTAAATTATCCATCAGCACAAGCTGTCAGTGGCCCCAGCCATGAATAAATGTATAGGGGGAAAGGCAGGAGCCTTGGGGTCGAGGAAAACAGGTAGGGTATAAAAAGGGCACGCAAGGGACCAAGTCCAGCATCCTAGAGTCCAGATTCCAAACTGCTCAGAGTCCTGTGGACAGATCACTGCTTGGCAATGGCTACAGGTAAGCATGCGCAAATCCCGCTGGGTGTGGTTTGGGACCCAGGGCCCCTGAAGATGGATCTGAGGCTTCTAATGTGAGTGCGTTCCAACTTCTGCCATGTTGGGAATACTCTGGGTCCCTATGGGGATTGGGAGAGATCGGCCATTGCTCCCAGGTTTCTCCTGCCCTCCTGTCTCTCTCTAGACTCTCGGACCTCCTGGCTCCTGACCGTCAGCCTGCTCTGCCTGCTCTGGCCTCAGGAGGCTAGTGCTTTTCCCGCCATGCCCTTGTCCAGTCTGTTTTCTAATGCTGTGCTCCGAGCCCAGCACCTGCACCAGCTGGCTGCTGACACCTACAAAGAGTTCGTAAGTTCCCCAGAGATGGGTGCCCGTTTGTGGAAGCAGGAAGGGGCAGGTCCTACCCCATACTCCTGGCCCCAGGGAAGGTCAATGGAGGGGAAATTATGGGGTAGGGGAATCTTAGCCAATGCTGTACCATAGTAATGATGGTGACGAGACACAAGCTGGTCCCTCAGTGACCACCCTTCTTCCAGGAGCGTGCCTACATTCCCGAGGGACAGCGCTATTCCATTCAGAATGCCCAGGCTGCTTTCTGCTTCTCAGAGACCATCCCGGCCCCCACAGGCAAGGAGGAGGCCCAGCAGAGAACCGTGAGTAGTCCCAGGCCTTGTCTGCACAAATCCTCGTTTCCCTCCATGCAGCCCTAACTGCACTCCAGGCCAGGGACCAGCTCCTCCCTGAAGCTGGGGTAACCTGGGAGTCCCAGGCAGAGGTCACTAGGCAATACACTAACCCCAGCCCTTTTTTTCCCCCCTCAGGACATGGAATTGCTTCGCTTCTCGCTGCTGCTCATCCAGTCATGGCTGGGGCCCGTGCAGTTCCTCAGCAGGATTTTCACCAACAGCCTGATGTTCGGCACCTCGGACCGTGTCTATGAGAAACTGAAGGACCTGGAAGAGGGCATCCAGGCTCTGATGCAGGTGAGGATGGACTAGCCTGGGGTTATGCCTGGAGCCTAGGTGGGGCTCACTGTCCTCTGTTTTACCGGTCAGCCCTTAGACCCTTGAGAAGGCTTCTTCTTCTTCATTTTCCTTTATGAAGCCTCCAGGCTTTTCCTTCGGTCCTGGGGTGGAGGGAGGCACAGCTCCCGAGTCTCCTGCCCTTCTTTCCCACGACAGGAGCTGGAAGATGGCAGCCCCCGTGTTGGGCAGATCCTCAAGCAAACCTATGACAAGTTTGACGCCAACATGCGCAGCGACGACGCGCTGCTCAAAAACTATGGGCTGCTCTCCTGCTTCAAGAAGGACCTGCACAAAGCGGAGACCTACCTGCGGGTCATGAAGTGTCGCCGCTTTGTGGAAAGCAGCTGTGCCTTCTAGCCACTCACCAGTGTCTCTGCTGCACTCTCCTGTGCCTCCCTGCCCCCTGGCAACTGCCACCCCGCGCTTTGTCCTAATAAAATTAAGATGCATCATATCACCCGGCTAGAGGTCTTTCTGTTATGGGATGGAGCAGTTGTGTCAATCTTGTTCCTGGAAGCCTGCGAGAA
sequence alignment: why?
Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match.
This discovery was very important: strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it.
sequence alignment: why?
Terminology: Homology means that two (or more) sequences
have a common ancestor. This is a statement about evolutionary history.
Similarity simply means that two sequences are similar, by some criterion. It does not refer to any historical process, just to a comparison of the sequences by some method. It is a logically weaker statement.
However, in bioinformatics these two terms are often confused and used interchangeably. The reason is probably that significant similarity is such a strong argument for homology.
two protein alignment
many genes have a common ancestor
The basis for comparison of proteins and genes using the similarity of their sequences is that the proteins or genes are related by evolution; they have a common ancestor.
Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently.
Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.
definition of sequence alignment
Sequence alignment is the procedure of comparing two (pair-wise alignment) or more multiple sequences by searching for a series of individual characters or patterns that are in the same order in the sequences.
There are two types of alignment: local and global. In global alignment, an attempt is made to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment.
Local alignment concentrates on finding stretches of sequences with high level of
definition of sequence alignment
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - -
Local alignment
- - - - - - - A G K G - - - - - - - -
interpretation of sequence alignment
Sequence alignment is useful for discovering structural, functional and evolutionary information.
Sequences that are very much alike may have similar secondary and 3D structure, similar function and likely a common ancestral sequence. It is extremely unlikely that such sequences obtained similarity by chance. For DNA molecules with n nucleotides such probability is very low P = 4-n. For proteins the probability even much lower P = 20 –n, where n is a number of amino acid residues
Large scale genome studies revealed existence of horizontal transfer of genes and other sequences between species, which may cause similarity between some sequences in very distant species
Dot matrix analysis The dynamic programming (DP) algorithm Word or k-tuple methods
methods of sequence alignment
dot matrix analysis
A dot matrix analysis is a method for comparing two sequences to look for possible alignment (Gibbs and McIntyre 1970)
One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side
Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same
The process is continued until all possible comparisons between A and B are made
Any region of similarity is revealed by a diagonal row of dots
Isolated dots not on diagonal represent random matches
Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window
It means that instead of comparing a single sequence position more positions is compared at the same time and dot is printed only if a certain minimal number of matches occur
Dot matrix analysis can also be used to find direct and inverted repeats within the sequences
dot matrix analysis
Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
dot matrix analysis: two identical sequences
Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse
dot matrix analysis: two very different sequences
Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)
dot matrix analysis: two similar sequences
dynamic programming
back to the basics
DNA/RNA sequences: strings composed of an alphabet of 4 letters
Protein sequences: alphabet of 20 letters
why do we do it?
Identify a gene Find clues to gene function (ortholog?) Find other organisms with this gene (homology) Gather info for an evolutionary model …
alignment
alignment is the basis for finding similarity Pairwise alignment = dynamic programming Multiple alignment: protein families and functional domains Multiple alignment is "impossible" for lots of sequences Another heuristic - progressive pairwise alignment
an example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
alignment
three elements Perfect matches Mismatches Insertions & deletions (indel)
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
significant similarity in an alignment
Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated)
Ha: the sequences diverge from a common ancestor (related) Test statistic: Ymax = length of the longest running perfect matc
h subsequence
exact matching subsequences
In DNA alignment, the matching probability
Under Ho lengths of exact match subseq Y should follow a geometric distribution
2222tcgamatch ppppp
well-matching subsequences
Evolution may cause small differences to even sequences with a reasonably recent common ancestor.
We consider Ymax to be the longest subseq with up to k mismatches.
Y follow hyper-geometric distribution P-value: exact/simulated/approximate (independence among Y
does not hold any more)
choosing alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--
Which one is better?
Score Used to determine quality of match and basis for the
selection of matches. Scores are relative. Expectation value
An estimate of the likelihood that a given hit is due to pure chance, given the size of the database; should be as low as possible. E.V.’s are absolute. A high score and a low E.V. indicate a true hit.
Sequence identity (%) (or Similarity) Number of matched residues divided by total length of
probe
scoring
scoring rule
Example Score = (# matches) – (# mismatches) – (# indels) x 2
examples
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score: (+1x5) + (-1x6) + (-2x11) = -23
edit distance
The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other
Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance
nment)score(aligmax),d( & of alignment 21 ss21 ss
computing edit distance
How can we compute the edit distance?? If |s| = n and |t| = m, there are more than
alignments 2 sequences each of length 1000: > 10^600
The additive form of the score allows to perform dynamic programming to compute edit distance efficiently
m
nm
Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
recursive argument
])[],[(
])..[],..,[(])..[],..[(
1mt1ns
m1tn1sd1m1t1n1sd
Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
)],[(
])..[],..,[(])..[],..[(
1ns
1m1tn1sd1m1t1n1sd
recursive argument
Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
])[,(
])..[],..,[(])..[],..[(
1nt
m1t1n1sd1m1t1n1sd
recursive argument
Define the notation:
Using the recursive argument, we get the following recurrence for V:
])[,(],[
)],[(],[
])[],[(],[
max],[
1jtj1iV
1is1jiV
1jt1isjiV
1j1iV
])..[],..[(],[ j1ti1sdjiV
recursive argument
recursive argument
Of course, we also need to handle the base cases in the recursion:
])1[,(],0[]1,0[
)],1[(]0,[]0,1[
0]0,0[
jtjVjV
isiViV
V
We fill the matrix using the recurrence rule
0A1
G2
C3
0
A 1
A 2
A 3
C 4
dynamic programming algorithm
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
Conclusion: d(AAAC,AGC) = -1
dynamic programming algorithm
interpretation of pointers
Insertion of S2(j) into S1
Deletion of S1(i) from S1
Match or Substitution
We now trace back the path the corresponds to the best alignment
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
AAACAG-C
reconstructing the best alignment
reconstructing the best alignment
Sometimes, more than one alignment has the best score
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
AAACA-GC
complexity
Space: O(mn) Time: O(mn) Filling the matrix O(mn) Backtrace O(m+n)
other scoring schemes
Needleman and Wunsch: 1 for identical amino acid, 0 otherwise
Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919).
… Different Gap Cost Function
scoring matrix for protein sequences
substitution “log odds” matrix BLOSUM 62
Henikoff and Henikoff (1992; PNAS 89:10915-10919)
( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5).
PAM 250 matrix
multiple sequence alignment
Often a probe sequence will yield many hits in a search. Then we want to know which are the residues and positions that are common to all or most of the probe and match sequences
In multiple sequence alignment, all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence.
name of homologous domians position of residue
residues and position common to most homologs consensus
an example
cellulose-binding domain of cellobiohydrolase
why multiple sequence alignment?
Identify consensus segments Hence the most conserved sites and residues
Use for construction of phylogenesis Convert similarity to distance
www.ch.embnet.org/software/ClustalW.html Of genes, strains, organisms, species, life
sequence logo
This shows the conserved residues as larger characters, where the total height of a column is proportional to how conserved that position is. Technically, the height is proportional to the information content of the position.
sample multiple alignment
Eukarya
Bacteria
A. aeolicus
T. maritima
Archaea
with k-mers (16s RNA, 35 organisms)
Black tree: dist ’n of 8-mers. Red tree: sequence aligment .
constructing the tree of life
databases of multiple alignments
Pfam: Protein families database of aligments and HMMs www.cgr.ki.se
PRINTS, multiple motifs consisting of ungapped, aligned segments of sequences, which serve as fingerprints for a protein family www.bioinf.man.ac.uk
BLOCKS, multiple motifs of ungapped, locally aligned segments created automatically fhcrc.org
software
manual alignment- software
GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from:
http://www.biochem.ucl.ac.uk
Seqapp/Seqpup- Mac/PC/UNIX available from: http://iubio.bio.indiana.edu
SeAl for Macintosh, available from: http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
BioEdit for PC, available from: http://www.mbio.ncsu.edu/RNaseP/info/programs/
BIOEDIT/bioedit.html
Search a sequence database for fragments similar to the query sequence
1. Compile a list of high-scoring short words shared by the query sequence and the database;
2. Scan the database for “hits”3. Expand the “hits” to MSP (maximum segment pair =
a pair of equal-length/no-gap segments with the highest alignment score)
BLAST
BLAST
Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
Variations of BLAST designed for specific purposes http://www.ncbi.nlm.nih.gov/BLAST/
similarity searching the databanks
What is similar to my sequence? Searching gets harder as the databases get bigger -
and quality degrades Tools: BLAST and FASTA = time saving heuristics
(approximate) Statistics + informed judgement of the biologist
read out>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369
Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus
Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| ||||||||Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
structure - function relationships
Can we predict the function of protein molecules from their sequence?
sequence > structure > function
Conserved functional domains = motifs
Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.)
protein domains
(from ProDom database)
DNA sequencing
Automated sequencers > 40 KB per day 500 bp reads must be assembled into complete genes
errors especially insertions and deletions error rate is highest at the ends where we want to overlap the rea
ds vector sequences must be removed from ends
Faster sequencing relies on better software overlapping deletions vs. shotgun approaches: TIGR
DNA sequencing
finding genes in genome sequence is not easy
About 2% of human DNA encodes functional genes.
Genes are interspersed among long stretches of non-coding DNA.
Repeats, pseudo-genes, and introns confound matters
pattern finding tools
It is possible to use DNA sequence patterns to predict genes: promoters translational start and stop codes (ORFs) intron splice sites codon bias
Can also use similarity to known genes/ESTs
phylogenetics
Evolution = mutation of DNA (and protein) sequences
Can we define evolutionary relationships between organisms by comparing DNA sequences
is there one molecular clock? phenetic vs. cladisitic approaches lots of methods and software, what is the "correct" analysis?
phylogenetics
software tools on the web
Many of the best tools are free over the Web BLAST ENTREZ/PUBMED Protein motifs databases
Bioinformatics “service providers” DoubleTwist™, Celera, BioNavigator™
Hodgepodge collection of other tools PCR primer design Pairwise and Multiple Alignment
PC programs
Macintosh and Windows applications -Commercial Vector NTI™, MacVector™, OMIGA™, Sequencher™ - Freeware Phylip, Fasta, Clustal, etc.
Better graphics, easier to use Can't access very large databases or perform demanding calcu
lations Integration with web databases and computing services
Vector NTI
most important sequence databases
Genbank– maintained by USA National Center for Biology Information (NCBI) All biological sequences
www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
Genomes www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?
db=Genome Swiss-Prot - maintained by EMBL- European Bioinformatics
Institute (EBI ) Protein sequences
www.ebi.ac.uk/swissprot/
genome project
the human genome project
The genome sequence is complete - almost! Approximately 3.2 billion base pairs.
Any human gene can now be found in the genome by similarity searching with over 99% certainty.
However, the sequence still has many gaps hard to find an uninterrupted genomic segment for any gene still can’t identify pseudogenes with certainty
This will improve as more sequence data accumulates
all the genes
Raw Genome Data:example of the code
The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!
example of the code
inconsistency
Celera says that there are only ~34,000 genes so why are there ~60,000 human genes o
n Affymetrix GeneChips? Why does GenBank have 49,000 human g
ene coding sequences and UniGene have 96,000 clusters of unique human ESTs?
Clearly we are in desperate need of a theoretical framework to go with all of this data
http://www.celera.com/
implications for biomedicine
Physicians will use genetic information to diagnose and treat disease.
Virtually all medical conditions have a genetic component. Faster drug development research
Individualized drugs Gene therapy
All Biologists will use gene sequence information in their daily work
the equipment
meaning of the code …
meaning of the code …
evolution
how do genomes evolve?
Point mutations Rearrangements Recombination Selection and Drift
how can you view the evolution?
Individual gene alignment view (usually proteins) Dot plot or VISTA (local similarity) view Synteny view Composite (average) views
DNA dot plot view
Show one DNA along X-axis, second on Y-axis For every position along both, score local similarity Display 2-D plot of similarity in gray-scale
self-match
tandem duplication
dot plot example
random dot plot
promoterconservation
gene structure revealed by dot plot
synteny view
Synteny definition: a contiguous region in another genome that has more-or-less the same genes in the same order.
The boundaries of what constitutes synteny are a bit fuzzy… for example you probably wouldn’t say a region isn’t syntenic if it is missing one gene out of many.
single inversion
insertion or deletion
double inversion
intra-chromosomal rearrangements
inter-chromosomal rearrangements
syntenic scaling
These regions are perfectly syntenic, but on average the mouse has shorter regions separating alignable conserved blocks.
limitations to synteny view
Provides only overview of arrangement, with no information about the degree or areas of conservation.
As genomes become more distant synteny becomes more chaotic, until (in the extreme) most blocks are one gene long (e.g. flies vs. human).
In some cases, very deep synteny can be seen, most dramatically in the Hox clusters.
Hox cluster
composite or summary views
View comparative summaries that encapsulate general properties of the genome.
For example, G-C content comparison:
phylogenetics and evolution