Bioinformatics. introduction molecular biology biotechnology bioMEMS bioinformatics ...

207
bioinformatics

Transcript of Bioinformatics. introduction molecular biology biotechnology bioMEMS bioinformatics ...

Page 1: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

bioinformatics

Page 2: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

introduction molecular biology biotechnology bioMEMS bioinformatics bio-modeling cells and e-cells transcription and regulation cell communication neural networks dna computing fractals and patterns the birds and the bees ….. and ants

course layout

Page 3: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

book

Introduction to Computational Molecular Biology

Page 4: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

introduction

Page 5: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA

Page 6: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA

Page 7: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

central dogma

Page 8: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

definitions

Informatics the science of information management

Bioinformatics the science of biological information management

Page 9: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

what is bioinformatics?

Page 10: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Bioinformatics is Multidisciplinary

ComputerScience

Math

Statistics

StructuralBiology

Phylogenetics

Drug Design

Genomics

MolecularBiology

interdisciplinary

Page 11: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

increasing levels of complexity

Genome (DNA)

Transcriptosome (RNA)

Proteome (proteins)

Metabalome (metabolic pathways)

Page 12: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

1 2 3 5 10 16 24 35 49 72 101 157217

385652

1,160

2,009

3,841

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Millions

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Source: GenBank

GenBank basepair growth

growth of biological databases

Page 13: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

growth of biological databases

3D structures growthhttp://www.rcsb.org/pdb/holdings.html

Page 14: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

symbol meaning explanation

G G Guanine

A A Adenine

T T Thymine

C C Cytosine

R A or G puRine

Y C or T pYrimidine

N A, C, G or T aNy base

U U Uracil

DNA/RNA

Page 15: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

some definitions use of computers to catalog and organize molecular life science informa

tion into meaningful entities. subset of computational biology

Methods to analyse, store, search, retrieve and represent biological data by computers /in computers

massive amounts of data: databases extracting information and knowledge from "raw" data for most bioscientists, all they need in bioinformatics is

sequence analysis

definitions of bioinformatics

Page 16: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

bioinformatics is not just the storage of data in a computer.

bioinformatics is the use of computers to test a biological hypothesis prior to performing the experiment in the laboratory.

bioinformatics is the design of software programs that analyse data.

what does it do?

Page 17: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

nucleotide and protein sequences protein structures all sorts of functional data related to genes, proteins

and their regulation, interactions etc. curated and non-curated databases

bioinformatics databases

Page 18: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence searching and sequence alignments looking at properties that can be analyzed/predicted

from sequence data protein structures and their analysis structural classification visualisation of macromolecules ”system-wide” understanding of the biology of a given

organism

some goals

Page 19: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomes and their annotation

complete genomes of many organisms are available seeing ”parts lists” of everything an organism needs

and figuring out how they work together

annotation: looking at the DNA sequence

Page 20: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomes and their annotation

gene finding is not always straightforward problem: rare gene products, for which you cannot

find corresponding mRNA or protein sequences in databanks

additional complication: alternative splicing, many transcripts per gene

Page 21: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomes and their annotation

if you intend to analyze or just use data from a databankit is useful to know both the goals and the reality of their annotation level

inconsistencies, missing data even well-annotated databanks provide only a fraction

of all biologically relevant information relevant to a gene or a molecule (compared to literature)

Page 22: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

annotation: a vision

databank content: all knowlegde on functions of a gene product add structural information

insights in structure-function relationships add data on expression patterns and regulation

understanding cell differentiation and other big questions in biology on molecular level

Page 23: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

current -omics

Page 24: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

metabolomics

“…to identify, measure and interpret the complex time-related concentration, activity and flux of metabolites in cells, tissues, and other bio-samples such as blood, urine, and saliva.”

Page 25: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

systems biology

Integrated view of biology at multiple levels

Generation of quantitative, predictive models of the behavior of biological systems, such as organisms

Page 26: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

bioinformatics in short

very short

Page 27: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

common genes?

Page 28: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Application of information technology to the storage, management and analysis of biological information

Facilitated by the use of computers

what is bioinformatics?

Page 29: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

what is bioinformatics?

Sequence analysis Geneticists/ molecular biologists analyse genome

sequence information to understand disease processes Molecular modeling

Crystallographers/ biochemists design drugs using computer-aided tools

Phylogeny/evolution Geneticists obtain information about the evolution of

organisms by looking for similarities in gene sequences Ecology and population studies

Bioinformatics is used to handle large amounts of data obtained in population studies

Medical informatics Personalised medicine

Page 30: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Nucleotide sequence file

Search databases for similar sequences

Sequence comparison

Multiple sequence analysis

Design further experiments

Restriction mappingPCR planning

Translate into

protein

Search for known motifs

RNA structure prediction

non-coding

coding

Protein sequence analysis

Search for protein coding regions

Sequencing project management

Protein sequence file

Sequence comparison

Search for known motifs

Predict secondary structure

Predict tertiary

structureCreate a multiple

sequence alignment

Edit the alignment

Format the alignment for publication

Molecular phylogeny

Protein family analysis

Nucleotide sequence analysis

Sequence entry

sequence analysis: overviewManual

sequence entry

Sequence database browsing

Search databases for similar sequences

Page 31: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

gene sequencing

Automated chemical sequencing methods allow rapid generation of large data banks of gene sequences

Page 32: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Sequences producing significant alignments: (bits) Value

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)

Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + +Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233

Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+ESbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288

The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases

database similarity searching

Page 33: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . .814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | |136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| |173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216

sequence comparison

Gene sequences can be aligned to see similarities between gene from different sources

Page 34: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

restriction mapping

Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes

AceIII 1 CAGCTCnnnnnnn’nnn...AluI 2 AG’CTAlwI 1 GGATCnnnn’n_ApoI 2 r’AATT_yBanII 1 G_rGCy’CBfaI 2 C’TA_GBfiI 1 ACTGGGBsaXI 1 ACnnnnnCTCCBsgI 1 GTGCAGnnnnnnnnnnn...

BsiHKAI 1 G_wGCw’CBsp1286I 1 G_dGCh’C

BsrI 2 ACTG_Gn’BsrFI 1 r’CCGG_yCjeI 2 CCAnnnnnnGTnnnnnn...CviJI 4 rG’CyCviRI 1 TG’CADdeI 2 C’TnA_GDpnI 2 GA’TCEcoRI 1 G’AATT_CHinfI 2 G’AnT_CMaeIII 1 ’GTnAC_MnlI 1 CCTCnnnnnn_n’MseI 2 T’TA_AMspI 1 C’CG_GNdeI 1 CA’TA_TG

Sau3AI 2 ’GATC_SstI 1 G_AGCT’CTfiI 2 G’AwT_CTsp45I 1 ’GTsAC_

Tsp509I 3 ’AATT_TspRI 1 CAGTGnn’

50 100 150 200 250

Page 35: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

PCR primer design

Oligonucleotides for use in the polymerisation chain reaction can be designed using computer based programs

OPTIMAL primer length --> 20MINIMUM primer length --> 18MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000MINIMUM acceptable melting temp --> 57.000MAXIMUM acceptable melting temp --> 63.000MINIMUM acceptable primer GC% --> 20.000MAXIMUM acceptable primer GC% --> 80.000Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0

Page 36: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Alignment formatted using MacBoxshade

Sequences of proteins from different organisms can be aligned to see similarities and differences

multiple sequence alignment

Page 37: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

E.coli

C.botulinum

C.cadavers

C.butyricum

B.subtilis

B.cereus

Phylogenetic tree constructed using the Phylip package

Analysis of sequences allows evolutionary relationships to be determined

phylogeny inference

Page 38: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

large scale bioinformatics: genome projects

MappingIdentifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping

SequencingAssembling clone sequence reads into large (eventually complete) genome sequences

Gene discoveryIdentifying coding regions in genomic DNA by database searching and other

Function assignmentUsing database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted geneData miningSearching for relationships and correlations in the information

Genome comparisonComparing different complete genomes to infer evolutionary history and genome rearrangements

Page 39: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomics

Page 40: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

introduction to DNA microarrays

massive data sets from simultaneous expression levels of thousands of genes

impossible to grasp directly by the human mind methods are needed for finding meaningful results

and patterns from the bulk of data

Page 41: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA microarray bioinformatics

data manipulation: normalization etc. data clustering

genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.

cancer typing) data mining:

finding interpretation to clustering results example: recognition of regulatory factor binding sites in

coexpressed genes

Page 42: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

hierarchy of relationships:

genome

gene 1 gene 3gene 2 gene X

protein 1 protein 2 protein 3 protein X

function 1 function 2 function 3 function X

basis of molecular biology

Page 43: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

FERN 160,000,000,000LUNGFISH 139,000,000,000SALAMANDER 81,300,000,000NEWT 20,600,000,000ONION 18,000,000,000GORILLA 3,523,200,000MOUSE 3,454,200,000HUMAN 3,400,000,000 31,000Drosophila 137,000,000 13,500C. Elegans 96,000,000 19,000Yeast 12,000,000 6,315E. Coli 5,000,000 5,361smallest Genome ??????

genes

genome size

Page 44: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

comparative genomics whole-genome analyses evolution studies analyses of components in a ”complete” system

functional genomics = inferring functions from data expression patterns, gene regulation sequence comparisons, homologue relationships studies of gene variation, altered phenotypes

genomics

Page 45: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

gene finding is not always straightforward problem: rare gene products, for which you cannot

find corresponding mRNA or protein sequences in databanks

additional complication: alternative splicing, many transcripts per gene

even well-annotated databanks provide only a fraction of all biologically relevant information relevant to a gene or a molecule (compared to literature)

genomics

Page 46: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

massive data sets from simultaneous expression levels of thousands of genes

impossible to grasp directly by the human mind methods are needed for finding meaningful results

and patterns from the bulk of data

DNA microarrays

Page 47: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

data manipulation: normalization etc. data clustering

genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.

cancer typing) data mining:

finding interpretation to clustering results example: recognition of regulatory factor binding sites in

coexpressed genes

DNA microarrays

Page 48: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA array technology

Array TypeSpot Density

(per cm 2 )Probe Target Labeling

Nylon Macroarrays < 100 cDNA RNA RadioactiveNylon Microarrays < 5000 cDNA mRNA Radioactive/FlourescentGlass Microarrays < 10,000 cDNA mRNA FlourescentOligonucleotide Chips <250,000 oligo's mRNA Flourescent

Page 49: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

spotting robot

Page 50: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray expression analysis

Page 51: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray

Page 52: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

photolithography

Page 53: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

array terminology

Page 54: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

70 mer vs 40 mer Attachment

NH2

NH2

NH2

70 mer40 merTarget

Page 55: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray

Page 56: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray data

Page 57: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

control mouse

a stressed mouse

RNAextraction

target labeling

image analysis

gene expression analysis

determination of expression levels

Page 58: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA micro-array

Page 59: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomics

The application of high-throughput automated technologies to molecular biology.

The experimental study of complete genomes.

Page 60: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomics technologies

Automated DNA sequencing Automated annotation of sequences DNA microarrays

gene expression (measure RNA levels) single nucleotide polymorphisms (SNPs)

Protein chips (SELDI, etc.) Protein-protein interactions

Page 61: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

cDNA spotted microarrays

Page 62: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Affymetrix gene chips

Page 63: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray data analysis

Clustering and pattern detection Data mining and visualization Controls and normalization of results Statistical validation Linkage between gene expression data and gene

sequence/function/metabolic pathways databases Discovery of common sequences in co-regulated

genes Meta-studies using data from multiple experiments

Page 64: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

microarray data analysis

Page 65: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

impact on bioinformatics

Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets.

It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

Page 66: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

proteomics

Page 67: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

what is proteomics?

The analysis of the entire protein complement

expressed by a genome, or by a cell or tissue type.“

Two most related technologies 2-D electrophoresis: separation of complex protein

mixtures Mass spectrometry: Identification and structure

analysis

Wasinger VC et al Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium.

Electrophoresis 16 (1995) 1090-1094

Page 68: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

transcription

Page 69: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genomic DNA

Structure Regulation Information

Computers cannot determine which of these 3 roles DNA play solely based on sequence… (although we would all like to believe they can)

Page 70: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

introduction to proteomics

Definitions Classical - restricted to large scale analysis of gene

products involving only proteins Inclusive - combination of protein studies with analyses

that have genetic components such as mRNA, genomics, and yeast two-hybrid

Don’t forget that the proteome is dynamic, changing to reflect the environment that the cell is in

Page 71: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

1 gene is no longer equal to one protein 1 gene = how many proteins?

1 gene = 1 protein?

Page 72: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

why proteomics?

Annotation of genomes, i.e. functional annotation Genome + proteome = annotation

Protein Function Protein Post-Translational Modification Protein Localization and Compartmentalization Protein-Protein Interactions Protein Expression Studies

Differential gene expression is not the answer

Page 73: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

types of proteomics

Protein Expression Quantitative study of protein expression between

samples that differ by some variable

Structural Proteomics Goal is to map out the 3-D structure of proteins and

protein complexes

Functional Proteomics To study protein-protein interaction, 3-D structures,

cellular localization and PTMS in order to understand the physiological function of the whole set of proteome.

Page 74: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

introduction to proteomics

composition of the proteome depends on cell type, developmental phase and conditions

proteome analyses are still struggling to solve the ”basic proteome” of different cells and tissues or limited changes under changing conditions or during processes

current methods can only ”see” the most abundant proteins

Page 75: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

expression proteomics = differential proteomics = 2D-PE + MS

interaction proteomics functional proteomics = systematic perturbation or

functional inactivation of proteins in a given environment

structural proteomics

proteomics

Page 76: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

typically a combination of 2D protein electrophoresis and mass spectrometry

labour-intensive, not really ”high-throughput” methods

more efficient ”protein array” methods are emerging

proteomics experiments

Page 77: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

bioinformatics in proteomics

Page 78: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

High-throughput determination of the 3D structure of proteins Goal: to be able to determine or predict the structure of every

protein. Direct determination - X-ray crystallography and nuclear magnetic

resonance (NMR). Prediction

Comparative modeling - Threading/Fold recognition Ab initio

structural proteomics

Page 79: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

To study proteins in their active conformation. Study protein:drug interactions Protein engineering

Proteins that show little or no similarity at the primary sequence level can have strikingly similar structures.

why structural proteomics?

Page 80: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

FtsZ - protein required for cell division in prokaryotes, mitochondria, and chloroplasts.

Tubulin - structural component of microtubules - important for intracellular trafficking and cell division.

FtsZ and Tubulin have limited sequence similarity and would not be identified as homologous proteins by sequence analysis.

an example

Page 81: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Burns, R., Nature 391:121-123Picture from E. Nogales

FtsZ and tubulin have little similarity at the amino acid sequence level

homologues

Page 82: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Yes! Proteins that have conserved secondary structure can be derived from a common ancestor even if the primary sequence has diverged to the point that no similarity is detected.

are FtsZ and tubulin homologues?

Page 83: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

structure is function

Page 84: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

protein structure

Imaging Experimental X-ray diffraction data

Predicting structure in silico from sequence

Page 85: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Make crystals of your protein 0.3-1.0mm in size Proteins must be in an ordered, repeating pattern.

X-ray beam is aimed at crystal and data is collected. Structure is determined from the diffraction data.

X-ray crystallography

Page 86: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

http://www-structure.llnl.gov/Xray/101index.html

X-ray crystallography

Page 87: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Schmid, M. Trends in Microbiolgy, 10:s27-s31.

crystals

Page 88: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

X-ray crystallography

Protein must crystallize. Need large amounts (good expression) Soluble (many proteins aren’t, membrane proteins).

Need to have access to an X-ray beam. Solving the structure is computationally intensive. Time - can take several months to years to solve a structure

Efforts to shorten this time are underway to make this technique high-throughput.

Page 89: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

general process for proteomics research

Image Analysis

Digester

Spot picker

Gel hotel

Spotter

MS

2-D Gel

Page 90: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

取材自台大微生物生化系莊榮輝教授網頁

general process for proteomics research

Page 91: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

protein microarray

arrayIT TM

Page 92: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

protein microarray

arrayIT TM

G. MacBeath and S.L. Schreiber, 2000, Science 289:1760

Page 93: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

what can protein microarrays do?

1. Protein / protein interaction2. Enzyme / substrate interaction (transient)3. Protein / small molecule interaction4. Protein / lipid interaction5. Protein / glycan interaction6. Protein / Ab interaction

1. G. MacBeath and S.L. Schreiber, 2000, Science 289:1760

2. H.Zhu et al, 2001 Science 293:2101

3. Ziauddin J and Sabatini DM, 2001 Nature 411:107

Page 94: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

protein microarrays (Antibody arrays)

Page 95: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

the real world

The true spot quality compared to a real experiment

Page 96: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

mobility of protein in an electric field

Mobility : Electrolytic molecules move in an electric field

Mobility ~[Electric field (mV)][Net charge of molecule]

[Friction between molecules and matrix]

Page 97: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

2-dim electrophoresis

Page 98: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Digest to peptide fragmentMS analysis

2-D gel electrophoresis

First dimension denaturing iso-electric focusing separation according to the pI

Second dimension SDS-PAGE (Sodium Dodecyl

Sulfate coated in a Poly-Acrylamide Gel Electrophoresis)

Separation according to the molecular weight

pI is the iso-electric point of the protein

Page 99: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

result example

Page 100: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Ion source: substance to ion gas Mass analysis: according to mass/charge (m/z) Detection: femtomole -attomole

Ion source Ion separator detector

mass spectrometry

Page 101: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

++

++ ++

++ +

+

pulsed

UV or IR laser

(3-4 ns)

detector

vacuum

strong electric field

Time Of Flight tube

peptide mixture

embedded in

light absorbing

chemicals (matrix)

cloud of

protonated

peptide moleculesaccV

principle of mass spec

Page 102: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Linear Time Of Flight tube

Reflector Time Of Flight tube

detector

reflector

ion source

ion source

detector

time of flight

time of flight

principle of mass spec

Page 103: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

typical result

Page 104: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Nuclear Magnetic Resonance Spectroscopy (NMR)

Can perform in solution. No need for crystallization

Can only analyze proteins that are <300aa. Many proteins are much larger. Can’t analyze multi-subunit complexes

Proteins must be stable.

Page 105: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

structure modeling

Comparative modeling Modeling the structure of a protein that has a high degree of sequ

ence identity with a protein of known structure Must be >30% identity to have reliable structure

Threading/fold recognition Uses known fold structures to predict folds in primary sequence.

Ab initio Predicting structure from primary sequence data Usually not as robust, computationally intensive

Page 106: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence alignment

Page 107: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence alignment

©Ken Howard, Scientific American, July 2000

Page 108: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequenceGATCAAACATTAAACATCCTGAGATCCAAAGGTAAGAGATCTAGCCACAGGGAGTGCTGGGGATTCGGGTCCTGGTGATCTTCACATGCTGACATAGCTCAGCCCTTTTTGGCCCTGGCTTTGTCCTGTTGTGGGCTTTCCCATCTGCAACCCATGCTCCTGGGCCATTTTCCTATGGGCCAGGGAAAACAAGATGGGGTGAAGGCACCCTTACATTTAGGGGCAAGACCTAGTACTCAGAAGGATTCAGAAACTGAAATAGCTGGGTGATACCACACAGGTGCTAGGGATAAGGGGCCTTGAGCCATGGACCATGGGAACTACAAAGCTGAAGGAGCTGCTGCCTCAGCAGAACCAGCGCTTGAATTTGTTCTTTCAGAACCTCAGTCTCTTCCTCTGAAAAATGGGTGTGTTGTGTATCCCACATTCCCAAGTCAGCCATGGGACCAAATGTGAGCGTGTGGGTTTTGCCTCCTGAGAAACTCAGGGGAGCAGAATGCTACAGTGGGTGAATTGGATTCTTTCAGAGAGCCCACCCTGTTTCCCACATCAGCCAGAAGGCTCAAAACCCTGAAGAGCTTTCTGAACTTTGAGGTGCCCAAAGCTTCAGGGCTGTATGGGAAGCACCTGAGGTCCAAGTCCGTTTACAAGAATTTTGTTTTTTGGTTTACAGCTGCTTGGCCGGTCCAAGGAGCAGGTTTGGGTCCTGTGCTCCACAGACCTAAGGGTTACCTTAGAGCTTATGGGAGAGCATTGTGTGTGGACAGTGGACAGTGCCCTCTAGTGCTCAGTGTTAGCACTACATCCAGTTGCCCTCCACCAGTTTATGCTGCTGAGGAAGTCTTTCTTTTCCCAACAGCAGTGTCTCTCCCTCTCCCACCCCCTCTCCCTCTCCCTCCCCCCCTAGGTTATTTTTATTTTTACTGGTGTGTATGTGTGTGAGTCTATGTCACATGTATGAGAGTGCTTGTGGAGACCAGAAGAGGGCATCAGAAGAGCCCCTAGAACTGGAGTATAGGTGGTTGTGAGCCACTTGTCATGGGTGTTGGGAACCAAACTCAGGTTCTCTGGAAGAACAACAAGCTCCCTTATCATATAAGCCATCTCTAAATCCAGGACATTTTTTTTTTTTTTTTTGAGATTTAGAGATTCAAGGAGGAGGAACAATAGGAGGAAGAAGGGGACAGAATAAGGCCAACAAAATGACCAAGGAGGTATAGGCACTTGAAGCCAAACCTAAGTACCTGAGTTCAATCCCTGGGACCCACATGATGGAAAGATGGAATCGATCCCCAAAAGTTATCTTCTGATCCCTATATGCACACACTTGAGGATGGACAGACAAAGAGACAGACACACAAACACACACAAATGTAACTGAAAAAGAAACCTCTATGGGGACATCGCCTTCTTGGAGAGGCTCTGTTGCCCCTCATCCTAGTGAACAAACAACTCCTACTCCCTGCCAGAGTATCCTACCCTTGGATTCAAAATGGTCTCAGAGGACACACCGGGTGGGCTCTGTCGCTGGGATCTTGCATAACCAATGCCCATAAGCCTGGCAAAGGTGGCGATGAGACGATAAGGTCAGGGACATGACCGCAGAAGAGGAGTGGGGACGCGATGAGTGGGAGGAGCTTCTAAATTATCCATCAGCACAAGCTGTCAGTGGCCCCAGCCATGAATAAATGTATAGGGGGAAAGGCAGGAGCCTTGGGGTCGAGGAAAACAGGTAGGGTATAAAAAGGGCACGCAAGGGACCAAGTCCAGCATCCTAGAGTCCAGATTCCAAACTGCTCAGAGTCCTGTGGACAGATCACTGCTTGGCAATGGCTACAGGTAAGCATGCGCAAATCCCGCTGGGTGTGGTTTGGGACCCAGGGCCCCTGAAGATGGATCTGAGGCTTCTAATGTGAGTGCGTTCCAACTTCTGCCATGTTGGGAATACTCTGGGTCCCTATGGGGATTGGGAGAGATCGGCCATTGCTCCCAGGTTTCTCCTGCCCTCCTGTCTCTCTCTAGACTCTCGGACCTCCTGGCTCCTGACCGTCAGCCTGCTCTGCCTGCTCTGGCCTCAGGAGGCTAGTGCTTTTCCCGCCATGCCCTTGTCCAGTCTGTTTTCTAATGCTGTGCTCCGAGCCCAGCACCTGCACCAGCTGGCTGCTGACACCTACAAAGAGTTCGTAAGTTCCCCAGAGATGGGTGCCCGTTTGTGGAAGCAGGAAGGGGCAGGTCCTACCCCATACTCCTGGCCCCAGGGAAGGTCAATGGAGGGGAAATTATGGGGTAGGGGAATCTTAGCCAATGCTGTACCATAGTAATGATGGTGACGAGACACAAGCTGGTCCCTCAGTGACCACCCTTCTTCCAGGAGCGTGCCTACATTCCCGAGGGACAGCGCTATTCCATTCAGAATGCCCAGGCTGCTTTCTGCTTCTCAGAGACCATCCCGGCCCCCACAGGCAAGGAGGAGGCCCAGCAGAGAACCGTGAGTAGTCCCAGGCCTTGTCTGCACAAATCCTCGTTTCCCTCCATGCAGCCCTAACTGCACTCCAGGCCAGGGACCAGCTCCTCCCTGAAGCTGGGGTAACCTGGGAGTCCCAGGCAGAGGTCACTAGGCAATACACTAACCCCAGCCCTTTTTTTCCCCCCTCAGGACATGGAATTGCTTCGCTTCTCGCTGCTGCTCATCCAGTCATGGCTGGGGCCCGTGCAGTTCCTCAGCAGGATTTTCACCAACAGCCTGATGTTCGGCACCTCGGACCGTGTCTATGAGAAACTGAAGGACCTGGAAGAGGGCATCCAGGCTCTGATGCAGGTGAGGATGGACTAGCCTGGGGTTATGCCTGGAGCCTAGGTGGGGCTCACTGTCCTCTGTTTTACCGGTCAGCCCTTAGACCCTTGAGAAGGCTTCTTCTTCTTCATTTTCCTTTATGAAGCCTCCAGGCTTTTCCTTCGGTCCTGGGGTGGAGGGAGGCACAGCTCCCGAGTCTCCTGCCCTTCTTTCCCACGACAGGAGCTGGAAGATGGCAGCCCCCGTGTTGGGCAGATCCTCAAGCAAACCTATGACAAGTTTGACGCCAACATGCGCAGCGACGACGCGCTGCTCAAAAACTATGGGCTGCTCTCCTGCTTCAAGAAGGACCTGCACAAAGCGGAGACCTACCTGCGGGTCATGAAGTGTCGCCGCTTTGTGGAAAGCAGCTGTGCCTTCTAGCCACTCACCAGTGTCTCTGCTGCACTCTCCTGTGCCTCCCTGCCCCCTGGCAACTGCCACCCCGCGCTTTGTCCTAATAAAATTAAGATGCATCATATCACCCGGCTAGAGGTCTTTCTGTTATGGGATGGAGCAGTTGTGTCAATCTTGTTCCTGGAAGCCTGCGAGAA

Page 109: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence alignment: why?

Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match.

This discovery was very important: strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it.

Page 110: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence alignment: why?

Terminology: Homology means that two (or more) sequences

have a common ancestor. This is a statement about evolutionary history.

Similarity simply means that two sequences are similar, by some criterion. It does not refer to any historical process, just to a comparison of the sequences by some method. It is a logically weaker statement.

However, in bioinformatics these two terms are often confused and used interchangeably. The reason is probably that significant similarity is such a strong argument for homology.

Page 111: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

two protein alignment

Page 112: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

many genes have a common ancestor

The basis for comparison of proteins and genes using the similarity of their sequences is that the proteins or genes are related by evolution; they have a common ancestor.

Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently.

Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

Page 113: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

definition of sequence alignment

Sequence alignment is the procedure of comparing two (pair-wise alignment) or more multiple sequences by searching for a series of individual characters or patterns that are in the same order in the sequences.

There are two types of alignment: local and global. In global alignment, an attempt is made to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment.

Local alignment concentrates on finding stretches of sequences with high level of

Page 114: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

definition of sequence alignment

L G P S S K Q T G K G S - S R I W D N

Global alignment

L N - I T K S A G K G A I M R L G D A

- - - - - - - T G K G - - - - - - - -

Local alignment

- - - - - - - A G K G - - - - - - - -

Page 115: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

interpretation of sequence alignment

Sequence alignment is useful for discovering structural, functional and evolutionary information.

Sequences that are very much alike may have similar secondary and 3D structure, similar function and likely a common ancestral sequence. It is extremely unlikely that such sequences obtained similarity by chance. For DNA molecules with n nucleotides such probability is very low P = 4-n. For proteins the probability even much lower P = 20 –n, where n is a number of amino acid residues

Large scale genome studies revealed existence of horizontal transfer of genes and other sequences between species, which may cause similarity between some sequences in very distant species

Page 116: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Dot matrix analysis The dynamic programming (DP) algorithm Word or k-tuple methods

methods of sequence alignment

Page 117: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

dot matrix analysis

A dot matrix analysis is a method for comparing two sequences to look for possible alignment (Gibbs and McIntyre 1970)

One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side

Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same

The process is continued until all possible comparisons between A and B are made

Any region of similarity is revealed by a diagonal row of dots

Isolated dots not on diagonal represent random matches

Page 118: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window

It means that instead of comparing a single sequence position more positions is compared at the same time and dot is printed only if a certain minimal number of matches occur

Dot matrix analysis can also be used to find direct and inverted repeats within the sequences

dot matrix analysis

Page 119: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html

dot matrix analysis: two identical sequences

Page 120: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse

dot matrix analysis: two very different sequences

Page 121: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)

dot matrix analysis: two similar sequences

Page 122: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

dynamic programming

Page 123: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

back to the basics

DNA/RNA sequences: strings composed of an alphabet of 4 letters

Protein sequences: alphabet of 20 letters

Page 124: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

why do we do it?

Identify a gene Find clues to gene function (ortholog?) Find other organisms with this gene (homology) Gather info for an evolutionary model …

Page 125: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

alignment

alignment is the basis for finding similarity Pairwise alignment = dynamic programming Multiple alignment: protein families and functional domains Multiple alignment is "impossible" for lots of sequences Another heuristic - progressive pairwise alignment

Page 126: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

an example

GCGCATGGATTGAGCGA

TGCGCCATTGATGACCA

possible alignment

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Page 127: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

alignment

three elements Perfect matches Mismatches Insertions & deletions (indel)

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Page 128: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

significant similarity in an alignment

Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated)

Ha: the sequences diverge from a common ancestor (related) Test statistic: Ymax = length of the longest running perfect matc

h subsequence

Page 129: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

exact matching subsequences

In DNA alignment, the matching probability

Under Ho lengths of exact match subseq Y should follow a geometric distribution

2222tcgamatch ppppp

Page 130: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

well-matching subsequences

Evolution may cause small differences to even sequences with a reasonably recent common ancestor.

We consider Ymax to be the longest subseq with up to k mismatches.

Y follow hyper-geometric distribution P-value: exact/simulated/approximate (independence among Y

does not hold any more)

Page 131: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

choosing alignments

There are many possible alignments

For example, compare:

-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A

to

------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--

Which one is better?

Page 132: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Score Used to determine quality of match and basis for the

selection of matches. Scores are relative. Expectation value

An estimate of the likelihood that a given hit is due to pure chance, given the size of the database; should be as low as possible. E.V.’s are absolute. A high score and a low E.V. indicate a true hit.

Sequence identity (%) (or Similarity) Number of matched residues divided by total length of

probe

scoring

Page 133: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

scoring rule

Example Score = (# matches) – (# mismatches) – (# indels) x 2

Page 134: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

examples

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1x5) + (-1x6) + (-2x11) = -23

Page 135: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

edit distance

The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other

Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance

nment)score(aligmax),d( & of alignment 21 ss21 ss

Page 136: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

computing edit distance

How can we compute the edit distance?? If |s| = n and |t| = m, there are more than

alignments 2 sequences each of length 1000: > 10^600

The additive form of the score allows to perform dynamic programming to compute edit distance efficiently

m

nm

Page 137: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

recursive argument

])[],[(

])..[],..,[(])..[],..[(

1mt1ns

m1tn1sd1m1t1n1sd

Page 138: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

)],[(

])..[],..,[(])..[],..[(

1ns

1m1tn1sd1m1t1n1sd

recursive argument

Page 139: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

])[,(

])..[],..,[(])..[],..[(

1nt

m1t1n1sd1m1t1n1sd

recursive argument

Page 140: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Define the notation:

Using the recursive argument, we get the following recurrence for V:

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

])..[],..[(],[ j1ti1sdjiV

recursive argument

Page 141: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

recursive argument

Of course, we also need to handle the base cases in the recursion:

])1[,(],0[]1,0[

)],1[(]0,[]0,1[

0]0,0[

jtjVjV

isiViV

V

Page 142: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

We fill the matrix using the recurrence rule

0A1

G2

C3

0

A 1

A 2

A 3

C 4

dynamic programming algorithm

Page 143: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

Conclusion: d(AAAC,AGC) = -1

dynamic programming algorithm

Page 144: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

interpretation of pointers

Insertion of S2(j) into S1

Deletion of S1(i) from S1

Match or Substitution

Page 145: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

We now trace back the path the corresponds to the best alignment

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACAG-C

reconstructing the best alignment

Page 146: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

reconstructing the best alignment

Sometimes, more than one alignment has the best score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACA-GC

Page 147: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

complexity

Space: O(mn) Time: O(mn) Filling the matrix O(mn) Backtrace O(m+n)

Page 148: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

other scoring schemes

Needleman and Wunsch: 1 for identical amino acid, 0 otherwise

Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919).

… Different Gap Cost Function

Page 149: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

scoring matrix for protein sequences

Page 150: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

substitution “log odds” matrix BLOSUM 62

Henikoff and Henikoff (1992; PNAS 89:10915-10919)

Page 151: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5).

PAM 250 matrix

Page 152: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

multiple sequence alignment

Often a probe sequence will yield many hits in a search. Then we want to know which are the residues and positions that are common to all or most of the probe and match sequences

In multiple sequence alignment, all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence.

Page 153: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

name of homologous domians position of residue

residues and position common to most homologs consensus

an example

cellulose-binding domain of cellobiohydrolase

Page 154: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

why multiple sequence alignment?

Identify consensus segments Hence the most conserved sites and residues

Use for construction of phylogenesis Convert similarity to distance

www.ch.embnet.org/software/ClustalW.html Of genes, strains, organisms, species, life

Page 155: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sequence logo

This shows the conserved residues as larger characters, where the total height of a column is proportional to how conserved that position is. Technically, the height is proportional to the information content of the position.

Page 156: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

sample multiple alignment

Page 157: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Eukarya

Bacteria

A. aeolicus

T. maritima

Archaea

with k-mers (16s RNA, 35 organisms)

Black tree: dist ’n of 8-mers. Red tree: sequence aligment .

constructing the tree of life

Page 158: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

databases of multiple alignments

Pfam: Protein families database of aligments and HMMs www.cgr.ki.se

PRINTS, multiple motifs consisting of ungapped, aligned segments of sequences, which serve as fingerprints for a protein family www.bioinf.man.ac.uk

BLOCKS, multiple motifs of ungapped, locally aligned segments created automatically fhcrc.org

Page 159: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

software

Page 160: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

manual alignment- software

GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from:

http://www.biochem.ucl.ac.uk

Seqapp/Seqpup- Mac/PC/UNIX available from: http://iubio.bio.indiana.edu

SeAl for Macintosh, available from: http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

BioEdit for PC, available from: http://www.mbio.ncsu.edu/RNaseP/info/programs/

BIOEDIT/bioedit.html

Page 161: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Search a sequence database for fragments similar to the query sequence

1. Compile a list of high-scoring short words shared by the query sequence and the database;

2. Scan the database for “hits”3. Expand the “hits” to MSP (maximum segment pair =

a pair of equal-length/no-gap segments with the highest alignment score)

BLAST

Page 162: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

BLAST

Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Variations of BLAST designed for specific purposes http://www.ncbi.nlm.nih.gov/BLAST/

Page 163: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.
Page 164: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.
Page 165: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.
Page 166: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

similarity searching the databanks

What is similar to my sequence? Searching gets harder as the databases get bigger -

and quality degrades Tools: BLAST and FASTA = time saving heuristics

(approximate) Statistics + informed judgement of the biologist

Page 167: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

read out>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369

Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus

Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59

Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119

Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179

Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239

Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| ||||||||Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

Page 168: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

structure - function relationships

Can we predict the function of protein molecules from their sequence?

sequence > structure > function

Conserved functional domains = motifs

Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.)

Page 169: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

protein domains

(from ProDom database)

Page 170: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA sequencing

Automated sequencers > 40 KB per day 500 bp reads must be assembled into complete genes

errors especially insertions and deletions error rate is highest at the ends where we want to overlap the rea

ds vector sequences must be removed from ends

Faster sequencing relies on better software overlapping deletions vs. shotgun approaches: TIGR

Page 171: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA sequencing

Page 172: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

finding genes in genome sequence is not easy

About 2% of human DNA encodes functional genes.

Genes are interspersed among long stretches of non-coding DNA.

Repeats, pseudo-genes, and introns confound matters

Page 173: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

pattern finding tools

It is possible to use DNA sequence patterns to predict genes: promoters translational start and stop codes (ORFs) intron splice sites codon bias

Can also use similarity to known genes/ESTs

Page 174: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

phylogenetics

Evolution = mutation of DNA (and protein) sequences

Can we define evolutionary relationships between organisms by comparing DNA sequences

is there one molecular clock? phenetic vs. cladisitic approaches lots of methods and software, what is the "correct" analysis?

Page 175: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

phylogenetics

Page 176: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

software tools on the web

Many of the best tools are free over the Web BLAST ENTREZ/PUBMED Protein motifs databases

Bioinformatics “service providers” DoubleTwist™, Celera, BioNavigator™

Hodgepodge collection of other tools PCR primer design Pairwise and Multiple Alignment

Page 177: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

PC programs

Macintosh and Windows applications -Commercial Vector NTI™, MacVector™, OMIGA™, Sequencher™ - Freeware Phylip, Fasta, Clustal, etc.

Better graphics, easier to use Can't access very large databases or perform demanding calcu

lations Integration with web databases and computing services

Page 178: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Vector NTI

Page 179: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

most important sequence databases

Genbank– maintained by USA National Center for Biology Information (NCBI) All biological sequences

www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

Genomes www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?

db=Genome Swiss-Prot - maintained by EMBL- European Bioinformatics

Institute (EBI ) Protein sequences

www.ebi.ac.uk/swissprot/

Page 180: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

genome project

Page 181: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

the human genome project

The genome sequence is complete - almost! Approximately 3.2 billion base pairs.

Page 182: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Any human gene can now be found in the genome by similarity searching with over 99% certainty.

However, the sequence still has many gaps hard to find an uninterrupted genomic segment for any gene still can’t identify pseudogenes with certainty

This will improve as more sequence data accumulates

all the genes

Page 183: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Raw Genome Data:example of the code

Page 184: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!

example of the code

Page 185: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

inconsistency

Celera says that there are only ~34,000 genes so why are there ~60,000 human genes o

n Affymetrix GeneChips? Why does GenBank have 49,000 human g

ene coding sequences and UniGene have 96,000 clusters of unique human ESTs?

Clearly we are in desperate need of a theoretical framework to go with all of this data

http://www.celera.com/

Page 186: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

implications for biomedicine

Physicians will use genetic information to diagnose and treat disease.

Virtually all medical conditions have a genetic component. Faster drug development research

Individualized drugs Gene therapy

All Biologists will use gene sequence information in their daily work

Page 187: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

the equipment

Page 188: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

meaning of the code …

Page 189: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

meaning of the code …

Page 190: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

evolution

Page 191: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

how do genomes evolve?

Point mutations Rearrangements Recombination Selection and Drift

Page 192: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

how can you view the evolution?

Individual gene alignment view (usually proteins) Dot plot or VISTA (local similarity) view Synteny view Composite (average) views

Page 193: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

DNA dot plot view

Show one DNA along X-axis, second on Y-axis For every position along both, score local similarity Display 2-D plot of similarity in gray-scale

Page 194: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

self-match

tandem duplication

dot plot example

Page 195: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

random dot plot

Page 196: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

promoterconservation

gene structure revealed by dot plot

Page 197: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

synteny view

Synteny definition: a contiguous region in another genome that has more-or-less the same genes in the same order.

The boundaries of what constitutes synteny are a bit fuzzy… for example you probably wouldn’t say a region isn’t syntenic if it is missing one gene out of many.

Page 198: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

single inversion

Page 199: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

insertion or deletion

Page 200: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

double inversion

Page 201: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

intra-chromosomal rearrangements

Page 202: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

inter-chromosomal rearrangements

Page 203: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

syntenic scaling

These regions are perfectly syntenic, but on average the mouse has shorter regions separating alignable conserved blocks.

Page 204: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

limitations to synteny view

Provides only overview of arrangement, with no information about the degree or areas of conservation.

As genomes become more distant synteny becomes more chaotic, until (in the extreme) most blocks are one gene long (e.g. flies vs. human).

In some cases, very deep synteny can be seen, most dramatically in the Hox clusters.

Page 205: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

Hox cluster

Page 206: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

composite or summary views

View comparative summaries that encapsulate general properties of the genome.

For example, G-C content comparison:

Page 207: Bioinformatics.  introduction  molecular biology  biotechnology  bioMEMS  bioinformatics  bio-modeling  cells and e-cells  transcription and regulation.

phylogenetics and evolution