BIOINFORMATICS Introduction

23
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a

description

BIOINFORMATICS Introduction. Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a. Biological Data. Computer Calculations. +. Bioinformatics. What is Bioinformatics?. - PowerPoint PPT Presentation

Transcript of BIOINFORMATICS Introduction

Page 1: BIOINFORMATICS Introduction

1

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

BIOINFORMATICSIntroduction

Mark Gerstein, Yale Universitybioinfo.mbb.yale.edu/mbb452a

Page 2: BIOINFORMATICS Introduction

2

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Bioinformatics

BiologicalData

ComputerCalculations

+

Page 3: BIOINFORMATICS Introduction

3

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

What is Bioinformatics?

• (Molecular) Bio - informaticsBioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Bioinformatics is a practical discipline with many applications.

Page 4: BIOINFORMATICS Introduction

4

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

What is the Information?Molecular Biology as an Information Science

• Central Dogmaof Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA

• Molecules Sequence, Structure, Function

• Processes Mechanism, Specificity, Regulation

• Central Paradigmfor Bioinformatics

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

• Large Amounts of Information Standardized Statistical

•Genetic material•Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity

•Most cellular functions are performed or facilitated by proteins.

•Primary biocatalyst

•Cofactor transport/storage

•Mechanical motion/support

•Immune protection

•Control of growth/differentiation

(idea from D Brutlag, Stanford, graphics from S Strobel)

Page 5: BIOINFORMATICS Introduction

5

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Molecular Biology Information - DNA

• Raw DNA Sequence Coding or Not? Parse into genes?

4 bases: AGCT ~1 K in a gene, ~2 M in genome

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Page 6: BIOINFORMATICS Introduction

6

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Molecular Biology Information: Protein Sequence

• 20 letter alphabet ACDEFGHIKLMNPQRSTVWY but not BJOUX Z

• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

• ~200 K known protein sequences

d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF

d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

Page 7: BIOINFORMATICS Introduction

7

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Molecular Biology Information:Macromolecular Structure

• DNA/RNA/Protein Almost all protein

(RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)

Page 8: BIOINFORMATICS Introduction

8

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Molecular Biology Information:

Whole Genomes• The Revolution Driving Everything

Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F.,

Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A.,

Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-

genome random sequencing and assembly of Haemophilus influenzae rd."

Science 269: 496-512.

(Picture adapted from TIGR website, http://www.tigr.org)

• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done

1997, yeast: 13 Mb & ~6000 genes for yeast

1998, worm: ~100Mb with 19 K genes

1999: >30 completed genomes!

2003, human: 3 Gb & 100 K genes...

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Page 9: BIOINFORMATICS Introduction

9

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Gene Expression Datasets: the Transcriptome

Also: SAGE; Samson and Church, Chips; Aebersold, Protein Expression

Young/Lander, Chips,Abs. Exp.

Brown, Botstein array,

Rel. Exp. over Timecourse

Snyder, Transposons, Protein Exp.

Page 10: BIOINFORMATICS Introduction

10

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Array Data

(courtesy of J Hager)

Yeast Expression Data in Academia: levels for all 6000 genes!

Can only sequence genome once but can do an infinite variety of these array experiments

at 10 time points, 6000 x 10 = 60K floats

telling signal from background

Page 11: BIOINFORMATICS Introduction

11

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

microarrays

• Affymetrix• Oligos

• Glass slides Pat brown

Page 12: BIOINFORMATICS Introduction

12

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Other Whole-Genome

Experiments

Systematic Knockouts

Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J. D., Bussey, H., Chu, A. M., Connelly, C., Davis, K., Dietrich, F., Dow, S. W., El Bakkoury, M., Foury, F., Friend, S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Davis, R. W. & et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6

2 hybrids, linkage maps

Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. & Zhu, L. (1998). Construction of a modular yeast two-hybrid cDNA library from human EST clones for the human genome protein linkage map. Gene 215, 143-52

For yeast: 6000 x 6000 / 2 ~ 18M interactions

Page 13: BIOINFORMATICS Introduction

13

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Molecular Biology Information:Other Integrative Data

• Information to understand genomes Metabolic Pathways

(glycolysis), traditional biochemistry

Regulatory Networks Whole Organisms

Phylogeny, traditional zoology

Environments, Habitats, ecology

The Literature (MEDLINE)

• The Future....

(Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)

Page 14: BIOINFORMATICS Introduction

14

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Large-scale Information:GenBank Growth

Year Base Pairs Sequences1982 680338 6061983 2274029 24271984 3368765 41751985 5204420 57001986 9615371 99781987 15514776 145841988 23800000 205791989 34762585 287911990 49179285 395331991 71947426 556271992 101008486 786081993 157152442 1434921994 217102462 2152731995 384939485 5556941996 651972984 10212111997 1160300687 17658471998 2008761784 28378971999 3841163011 48645702000 8604221980 7077491

GenBank Data

Page 15: BIOINFORMATICS Introduction

15

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Large-scale Information:Exponential Growth of Data Matched by Development of Computer Technology

• CPU vs Disk & Net As important as the

increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial

• Driving Force in Bioinformatics

(Internet picture adaptedfrom D Brutlag, Stanford)

Structures in PDB0

50010001500200025003000350040004500

1980 1985 1990 19950

20

40

60

80

100

120

1401979 1981 1983 1985 1987 1989 1991 1993 1995

CPU Instruction

Time (ns)

Num.Protein DomainStructures

Internet Hosts

Page 16: BIOINFORMATICS Introduction

16

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

The Next Stepafter the sequence:

Pseudogenes, Regulatory Regions, Repeats

Step 1: The genome

sequence and genes

ProteomicsExpression

AnalysisStructural Genomics,

Protein Interactions

Comprehensive Understanding of Gene Function on a Genomic Scale

Evolutionary Implications of Intergenic Regions as Gene Graveyard

Page 17: BIOINFORMATICS Introduction

17

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

The next

step:

Page 18: BIOINFORMATICS Introduction

18

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Organizing Molecular Biology

Information:Redundancy and

Multiplicity• Different Sequences Have the Same

Structure• Organism has many similar genes• Single Gene May Have Multiple Functions• Genes are grouped into Pathways• Genomic Sequence Redundancy due to the

Genetic Code• How do we find the similarities?.....

(idea from D Brutlag, Stanford)

Integrative Genomics - genes structures functions pathways expression levels regulatory systems ….

Page 19: BIOINFORMATICS Introduction

19

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Integrative Genomic Surveys of Many Proteins from Many Perspectives

“Prediction” Bioinformatics

(focused on individual genes and structures)

vs

Page 20: BIOINFORMATICS Introduction

20

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Major Application I:Designing Drugs

• Understanding How Structures Bind Other Molecules (Function)

• Designing Inhibitors

• Docking, Structure Modeling

(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).

Page 21: BIOINFORMATICS Introduction

21

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

FoldRecognition

GenePrediction

SequenceAlignment

Expression Clustering

SecondaryStructurePrediction

Docking &Drug Design

ProteinFlexibility

GenomeAnnotation

FunctionClass-

ification

StructureClassification

ProteinGeometry

Bioinformatics Subtopics

HomologyModeling

DatabaseDesign

E-literature

Large-ScaleGenomicSurveys

Page 22: BIOINFORMATICS Introduction

22

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

What is Bioinformatics?

• (Molecular) Bio - informaticsBioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Bioinformatics is a practical discipline with many applications.

Page 23: BIOINFORMATICS Introduction