[email protected] bork.embl-heidelberg.de
description
Transcript of [email protected] bork.embl-heidelberg.de
[email protected]://www.bork.embl-heidelberg.de/
Peer BorkPeer BorkEMBL & MDCEMBL & MDC
Heidelberg & BerlinHeidelberg & Berlin
Proteome analysis Proteome analysis in silicoin silico
‘‘omics – research on an entirety of biomolecular objectsomics – research on an entirety of biomolecular objects
Proteomics – research on the entirety of proteins (so Proteomics – research on the entirety of proteins (so far in an organism) coined beginning of the 90thfar in an organism) coined beginning of the 90th
Original intention exemplified by the genome:Original intention exemplified by the genome:
Common Praxis:Common Praxis:
‘‘omics - used to describe large-scale approachesomics - used to describe large-scale approaches(whereby large is sometimes 1)(whereby large is sometimes 1)
‘‘omes: use and misuseomes: use and misuse
Proteomics - used for research on many proteinsProteomics - used for research on many proteins(whereby many might mean 3)(whereby many might mean 3)
‘‘ome – entirety of biomolecular objects (ALL genes etc)ome – entirety of biomolecular objects (ALL genes etc)
Protein profilingProtein profiling andand interaction proteomicsinteraction proteomics
Originally two main directions:Originally two main directions:
Protein profiling: establishment of protein inventories Protein profiling: establishment of protein inventories under controlled conditions (organelles, tissues, under controlled conditions (organelles, tissues, organisms). organisms).
Interaction proteomics: identification of temporally Interaction proteomics: identification of temporally and spatially defined functional modules formed by and spatially defined functional modules formed by proteinsproteins
Bioinformatics analysis is essential in both areasBioinformatics analysis is essential in both areas
Part IPart I
Part IIPart II
Protein detection and annotation by homology and Protein detection and annotation by homology and orthology (orthology (function in1Dfunction in1D))
Protein interactions and protein networks (Protein interactions and protein networks (function in 2Dfunction in 2D))
Proteome analysis in silicoProteome analysis in silico
Temporal and spatial considerations (Temporal and spatial considerations (function in 3D+4Dfunction in 3D+4D))
AlternativeAlternativeSplicingSplicing
GenomeGenomeannotationannotation
Bork et al. Bork et al. JMolBiol 1998JMolBiol 1998
Domain analysisDomain analysis
Protein networksProtein networks
Literature miningLiterature miningcoupled tocoupled togenomic datagenomic data
70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference
Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.
Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998
Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein
Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999
Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999
Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein
Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999
GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999
Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996
Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein
Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996
Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998
Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein
Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein
Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99
Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b
Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998
Concepts in function predictionConcepts in function predictionHomology-basedHomology-based (intrinsic molecular features)(intrinsic molecular features)
Gene context Gene context (functional associations)(functional associations)
- Sequence and domain DBs (Blast, Pfam,Smart)- Sequence and domain DBs (Blast, Pfam,Smart)
- Gene neighbourhood, fusion, co-occurrence- Gene neighbourhood, fusion, co-occurrence- Shared regulatory elements- Shared regulatory elements
Other Other (residue level, functional class )(residue level, functional class )- Correlated mutations- Correlated mutations- Interaction threading- Interaction threading
- Function transfer by orthology- Function transfer by orthology
- Feature analysis- Feature analysis
www.bork.embl-heidelberg.de
I. Homology-based protein annotationI. Homology-based protein annotation
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Evolution of protein functionEvolution of protein function
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
Homology detection and domain annotationHomology detection and domain annotationHomology detection and domain annotationHomology detection and domain annotation
Status of Status of homology based homology based function predictionfunction prediction
Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction
Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence
Henikoff et al. 1997 Science 278, 609
0
5
10
15
20
25
30
35
40<1
985
85/8
6
87/8
8
89/9
0
91/9
2
93/9
4
95/9
6
97/9
8
99/0
0
01//0
2
03/n
ow
cytoplasmic domainsnuclear domains
History of signaling domain discovery History of signaling domain discovery
SystematicSystematicdiscovery by discovery by 1) searching 1) searching ‘in between’‘in between’regionsregions2) starting 2) starting with repeatswith repeats
Doerks et al. 2002Doerks et al. 2002Genome Res.Genome Res.Ponting et al. 2001Ponting et al. 2001Genome Res.Genome Res.
Domain discovery in disease genesDomain discovery in disease genesgene/protein disease domains reference
dystrophin Muscular dystrophy WW Bork & Sudol: TIBS 19(94)531
X11 Friedreich's ataxia (c) PI/PTB+PDZ Bork & Margolis: Cell 80(95)693
PKD1 Polycystic kidney many (PKD1) Int. PKD1 consortium: Cell 81(95)298
HD Huntington's HEAT repeats Andrade & Bork: Nat.Genet.11(95)115
BRCA2 Breast cancer BRC repeats Bork et al.: Nat. Genet. 13 (96) 22
BRCA1 Breast cancer BRCT Koonin et al.: Nat. Genet. 13 (96) 266
dsh DiGeorge syndrome DEP Ponting & Bork: TIBS 21(96) 245
X25 (FRDA) Friedreich's ataxia CyaY Gibson et al. : TINS 19 (96) 465
beige/CH Chediak-Higashi BEACH Nagle et al. : Nat. Genet. 14 (96) 307
RB Retinoblastoma BRCT Bork et al. :FASEB J. 11 (97) 68
9 incl. HML1 Colon cancer HSP90 Mushegian et al. : PNAS 94 (97) 5831
TSG101 Breast cancer UBC Ponting, Cai & Bork: JMM 75 (97) 467
WRN/BLM Werner + Bloom syn. HRDC Morozov et al. : TIBS 22 (97) 417
2 inc pyrin Mediterrian fever SPRY Schultz et al. : PNAS 95 (98) 5857
p73 various tumors? SAM Bork & Koonin: Nat. Genet. 18 (98) 313
mahagony Obesity PSI Nagle et al.: Nature 398 (99) 148
Parkin AP-J Parkinsonism IBR Morett & Bork: TIBS 24 (99) 229
SMARTSMARTBlast-like inputBlast-like input
-Access to different databases-Domain annotation & architecture
www.smart.embl-heidelberg.de
Collaboration withChris Ponting
-Alerting
Digested outputDigested output
-signal sequence, Coiled coil and TM
-Pfam integrated
SMARTSMART
-comparison of domain context
www.smart.embl-heidelberg.de
• Calpain7MIT
• Spastin • SKD1 protein • VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B)• Tobacco mosaic virus helicase domain-binding protein
MIT
• Sorting nexin 15MIT
• RSK-like protein MIT
• Similar to ribosomal protein S6 kinaseMIT
• CG8866 MIT
Ciccarelli, F. D., et al. Genomics 81(03)437Patel, H. et al. Nat Genet 31(02)347,
Spartin
Mutation
MIT Plant-relatedPlant-related
A putative transport-associated microtubule-binding domainA putative transport-associated microtubule-binding domain
Unifying disorders associated to hereditary spastic paraplegia?Unifying disorders associated to hereditary spastic paraplegia?
www.bork.embl-heidelberg.de
I. Homology-based genome annotationI. Homology-based genome annotation
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Evolution of protein functionEvolution of protein function
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
Homology detection and domain annotationHomology detection and domain annotationHomology detection and domain annotationHomology detection and domain annotation
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
21
Number of human genes in timeNumber of human genes in time
Aug00 Apr01Oct00 Dec00 Feb01Feb00 0
100
120
20
40
80
60
HGS, Incyte and coTextbooks, public opinion
Celera
HGP38 32
5239
27 24 22
No h
uman
gen
es in
thou
sand
s
HGS
othersBasis for Feb 01 publications
10T
8T
6T
4T
2T
NEMAX50 index
Jan05
10T
8T
6T
4T
2T
TecDAX index
Improvement of gene cluster predictionsImprovement of gene cluster predictionsMouse chr4:94-94,6 Mb p450 (CYP2J) region: 8 genes8 genes / / 11 pseudogenic fragments11 pseudogenic fragments
cyp2j6 cyp2j9 cyp2j5
Known genesKnown genes
cyp2j13
ESTsESTs
TwinscanTwinscan (1 gene)(1 gene)
GeneIDGeneID (3 genes)(3 genes)
fgenesh++fgenesh++ (13 genes)(13 genes)
ENSEMBLENSEMBL (9 genes)(9 genes)
Manual Manual (8genes)(8genes)
(comparison performed in 2004)(comparison performed in 2004)
BLAST2GENE finds independent gene copies BLAST2GENE finds independent gene copies BLAST of cyp2j13 protein vs. Mouse chr4:94-94,6 Mb
~ 150 Alignments
BLAST2GENEBLAST2GENE
100
200
300
400
Mm
.cyp
2 j.p
ep
(le
n=
501)
GE
NE
_1~(
4764
..13
967)
cov =
0.4
41
id%
= 60
.4
GE
NE
_2~(
3599
3..7
7274
)
cov =
0.9
72
id%
= 66
.5
GE
NE
_3~(
8792
1..1
0691
3)co
v = 0
.166
id
%=
59.4
GE
NE
_4~(
1265
47..
1267
08)
cov =
0.1
08
id%
= 68
.0
GE
NE
_5~(
1316
66..
1723
08)
cov =
0.9
80
id%
= 63
.2
GE
NE
_6~(
1813
33..
2209
95)
cov =
0.4
41
id%
= 50
.2
GE
NE
_7~(
2415
42..
2952
91)
cov =
0.9
78
id%
= 63
.4
GE
NE
_8~(
3029
76..
3782
93)
cov =
0.4
51
id%
= 53
.5
GE
NE
_9~(
3913
23..
4541
10)
cov =
0.9
92
id%
= 59
.9
GE
NE
_10
~(46
2789
..46
2893
)co
v = 0
.070
id
%=
57.0
GE
NE
_11
~(46
4757
..50
0175
)co
v = 0
.996
id
%=
67.2
GE
NE
_12
~(51
5451
..53
8069
)co
v = 0
.986
id
%=
61.0
GE
NE
_13
~(55
2820
..56
2733
)co
v = 0
.184
id
%=
62.7
GE
NE
_14
~(57
6195
..58
8175
)co
v = 0
.547
id
%=
87.8
GE
NE
_7~(
2415
42..
2952
91)
cov =
0.9
78
id%
= 63
.4
Hundrets often considerable differences to current gene prediction pipelines!Hundrets often considerable differences to current gene prediction pipelines!
regions containing independent elements
Merging of fragments of the same element
1. Similarity search in intergenic regions1. Similarity search in intergenic regionsMasking of known repeats and already predicted genes1.5-2 million fragments
fragments with significant sequence similarity
BLASTX vs nr prot. dbE-value < 0.001
Exclusion of transposon and virus derived sequence
Closest known protein (first blast hit)
GENEWISE
Torrents, Suyama, Bork Torrents, Suyama, Bork Genome Res. 13(2003)2550Genome Res. 13(2003)2550
Annotation of pseudogenes changes gene numbers Annotation of pseudogenes changes gene numbers
Ka/Ks functionality check
Ca 20.000 detectable pseudogenesCa 20.000 detectable pseudogenesin each: human, mouse, ratin each: human, mouse, rat
Still >3000 pseudogenes among the predicted human Still >3000 pseudogenes among the predicted human genes mid 2004 (build 34)genes mid 2004 (build 34)
e1 e2
Processed PseudogeneGenewise prediction using sptrembl|Q9HBM5
e3 e4 e5 e6
Processed PseudogeneGenewise prediction using SwissProt|RS2_RAT
80 kb
Predicted GeneMm chr1:7608644-7681026 Stop codon or
frameshift
2. Consistency check of gene predictions2. Consistency check of gene predictionsAnnotation of pseudogenes changes gene numbers Annotation of pseudogenes changes gene numbers
Arrays, chips et al. 20%off?Arrays, chips et al. 20%off?
genes
Protein diversity
20-40k genes20-40k genes
>100k transcripts>100k transcripts
>1000k proteins?>1000k proteins?
What do we count?What do we count?
0
5
10
15
20
25
30
35
40
45
50
0 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000
ESTs
%A
S
mouse
human
Rate of detectable alternative splicing depends Rate of detectable alternative splicing depends on EST coverage and library rangeon EST coverage and library range
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
AS per m
RN
A (x)
Brett Brett et al.et al. Nature GenetNature Genet. 30(2002)29. 30(2002)29
www.bork.embl-heidelberg.deBoue et al. Bioessays 03
Homology-based predictions of exons and Homology-based predictions of exons and alternative transcripts (alternative transcripts (www.smart.embl-heidelberg.dewww.smart.embl-heidelberg.de) )
SMART domain DBSMART domain DBlinks to genomeslinks to genomes
Top 10 domains* in human: 30% diff.!Top 10 domains* in human: 30% diff.!human fly worm
ImmunoglobulinC2H2zinc finger
*Only no of genes given, no of domains higher; note that only around 90% is sequenced
Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat
765 (381) 140 64706 (607) 357 151575 (501) 319 437569 (616) 97 358433 198 183350 10 50300 (224) 157 96277 (136) 162 102276 (145) 105 107
13300 18200
Nature 409 (01)860; Science 291(01)1304
Total no genesSpecies
Homeobox 267 (160) 148 109
26500(26500)
Metazoan genome annotation an ongoing process Metazoan genome annotation an ongoing process and far from completeand far from complete
>2000 pseudogenes in mammalian gene sets: Only now they are about to be included in prediction pipelines
Ca 150 retro-related genes in mammalian gene sets (>1000 in 2004), but true human genes sometimes suppressed
Annotation of gene clusters need considerable improvements
Alternative splicing still a major unknown Considerable human factor in annotation
www.bork.embl-heidelberg.de
I. Homology-based genome annotationI. Homology-based genome annotation
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Evolution of protein functionEvolution of protein function
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
Homology detection and domain annotationHomology detection and domain annotation
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Human: Human: NatureNature Feb 2001 Feb 2001
Mouse: Mouse: NatureNature Dec 2002 Dec 2002Mosquito: Mosquito: ScienceScience Oct 2002 Oct 2002
Rat: Rat: NatureNature Apr 2004 Apr 2004
7575
4040mousemouseratratchickenchicken
chimpchimp
310MY310MY
fugufugu450MY450MY
600-1200MY?600-1200MY?
??
C.eleg.C.eleg.
D.mena.D.mena.250MY250MY
mosquitomosquito
55humanhuman
chicken: chicken: NatureNature Dec 2004 Dec 2004
ChickenChicken genome analysisgenome analysis
Zdobnov et alZdobnov et alScience 02Science 02
15%15%
45%45%
Hillier et alHillier et alNature 04Nature 04
ChickenChicken genome analysis: orthology and cellular processesgenome analysis: orthology and cellular processes
75.4% identity (median)75.4% identity (median)between between
chicken and human chicken and human 1:1 orthologs1:1 orthologs
Immune response Immune response evolves fastestevolves fastest
www.bork.embl-heidelberg.de
Chicken genome analysis:Chicken genome analysis:
Innovation and Expansion of domain familiesInnovation and Expansion of domain families
Orthology analysis Orthology analysis reveals more reveals more
subtle functional subtle functional changeschanges
Evolution by duplication: Burst of an olfactory receptor familyEvolution by duplication: Burst of an olfactory receptor family
……thought tothought torecognize MHCrecognize MHCdiversitydiversity
chickenchicken
humanhuman
……221 copies 221 copies in chickenin chicken
……given a ca 300 given a ca 300 ORs in chickenORs in chickenand 450 in humanand 450 in human
Chicken genome analysis: Evolution of functionChicken genome analysis: Evolution of functionby domain accretionby domain accretion
Scavenger receptor cysteine-rich domain acquired Scavenger receptor cysteine-rich domain acquired by a fibrinogen-domain containing protein by a fibrinogen-domain containing protein (identified and displayed by SMART)(identified and displayed by SMART)
www.bork.embl-heidelberg.de
I. Homology-based genome annotationI. Homology-based genome annotation
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Evolution of protein functionEvolution of protein function
Metazoan genome annotation: the dark side…Metazoan genome annotation: the dark side…
Homology detection and domain annotationHomology detection and domain annotation
Metazoan proteome analysis: human vs chickenMetazoan proteome analysis: human vs chicken
Evolution of protein functionEvolution of protein function
PhylogeneticPhylogeneticDistribution ofDistribution of
orthologsorthologs
- Losses- Losses
Sterol MetabolismSqualene monooxygenase (EC 1.14.99.7) - - x x - x x
7-dehydrocholesterol reductase (EC 1.3.1.21) - - x x x x x
Farnesyl-diphosphate farnesyltransferase ( EC 2.5.1.21) - - x x - x x
Lanosterol synthase (EC 5.4.99.7) - - x x - x x
Lanosterol synthase (EC 5.4.99.7) - - x x - x x
3-oxo-5-alpha-steroid 4-dehydrogenase 1 (EC 1.3.99.5) - - x - x x x
C-5 sterol desaturase (EC 1.3.3.2) Ergosterol biosynthesis - - x x - x x
Cytochrome P450 P51, sterol 14-alpha demethylase - - x x - x x
diminuto/24-dehydrocholesterol reductase ('seladin1') - - x - x x x
Biosynthesis of NADKynureninase (EC 3.7.1.3) - - - x x x x
3-hydroxyanthranilate 3,4-dioxygenase (EC 1.13.11.6) synthesis of excitotoxin quinolinic acid - - - x x x x
Quinolinate phosphoribosyltransferase (EC 2.4.2.19) - - x x - x x
DNA-methylation and repairDNA (cytosine-5)-methyltransferase 1) - - x - - x x
uracil-DNA glycosylases - - x - x x x
DNA-(apurinic or apyrimidinic site) lyase (EC 4.2.99.18) - - - x x - -
D A P Y W H M D A P Y W H M Gene loss inGene loss indipteradiptera
Functional changes at evolutionary time scalesFunctional changes at evolutionary time scales
Orthologs mapped onto Orthologs mapped onto metazoan phylogenymetazoan phylogeny
Summary (homology-based function prediction)Summary (homology-based function prediction)
Emphasis in homology based genome annotation shifts from Emphasis in homology based genome annotation shifts from sensitivity (e.g. domain identification) to selectivity issues (orthology sensitivity (e.g. domain identification) to selectivity issues (orthology assignment for 1:1 function transfer)assignment for 1:1 function transfer)
Metazoan genome annotation is far from being complete and caution Metazoan genome annotation is far from being complete and caution is needed when using incomplete and partially erroneous parts list is needed when using incomplete and partially erroneous parts list (e.g. when predicting networks)(e.g. when predicting networks)
Yet, with the incoming number of metazoan genomes our Yet, with the incoming number of metazoan genomes our understanding of functional diversification at the protein level will understanding of functional diversification at the protein level will increase dramatically ....although the proteome remains far from increase dramatically ....although the proteome remains far from being decipheredbeing deciphered