Bioinformatics of Protein Domains: New Computational Approach for
Transcript of Bioinformatics of Protein Domains: New Computational Approach for
Maricel Kann. Feb-08
Bioinformatics of Protein Domains:Bioinformatics of Protein Domains:New Computational Approach for the New Computational Approach for the
Detection of Protein Domains Detection of Protein Domains
Maricel KannAssistant Professor
University of Maryland, Baltimore [email protected]
2 Maricel Kann. Feb-08
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTAATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATATATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCACGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAACAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTACTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAACCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATACTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAA
The Human Genome ProjectThe Human Genome Project
3 Maricel Kann. Feb-08© Sidney Harris
4 Maricel Kann. Feb-08
Protein ClassificationProtein Classification
A L I G N M E N T
A L I G G N M E N
QUERYSet of related sequences or protein family from database
A L I G N M E N T
A L I G G N M E N
4 3 4 7 1 2 -2 0 0
Alignment AlgorithmScoring Function
Accurate Statistics
A L I G - N M E N T
A L I G G N M E N -
score=19
PAM: PAM: DayhoffDayhoff et al. (1978); BLOSUM: et al. (1978); BLOSUM: HenikoffHenikoff & & HenikoffHenikoff (1992);(1992);OPTIMA:KannOPTIMA:Kann et al.et al. (2000).(2000).
5 Maricel Kann. Feb-08
Significance of a scoreSignificance of a score
Estimated number of non-related sequences in the database that score higher than the query
( )Q RE p S S D= <
D= size of database
6 Maricel Kann. Feb-08
Alignments’ scores
# of
alig
nmen
ts w
ith sc
ore
S
S SQ
random scores
( ) 1 exp[ ]RSQ Rp S S KMNe λ−< = − −
7 Maricel Kann. Feb-08
OutlineOutline
A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction
• Definition of protein domain.• Main features of the Conserved domain database (CDD)• Position specific scoring matrices (PSSM)• Classification of alignment methods
– Current methods for protein domain searches– Our approach (Global Blocks Aligned Locally)– Results
8 Maricel Kann. Feb-08
The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core.
9 Maricel Kann. Feb-08
Conserved DomainsConserved Domains• In 1974 Michael Rossman recognized the NADH
binding domain in several dehydrogenases (named after him).
• Conserved domains are determined by sequencecomparative analysis.
• Molecular evolution uses such domains as building blocks
• They may be recombined in different arrangements to make proteins with different functions.
• Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains.
10 Maricel Kann. Feb-08
CDD: a collection of domain multiple alignments linkedto protein 3D structure
11 Maricel Kann. Feb-08MarchlerMarchler--Bauer Bauer et al et al (2003) (2003) NAR NAR 383:387383:387
heme-binding site
It combines information about protein sequence, their conservationpatterns across evolution and the protein structure and provide useful functional annotation.
12 Maricel Kann. Feb-08
Protein ClassificationProtein Classification
QUERYSet of related sequences or protein family from database
Alignment AlgorithmScoring Function
Accurate Statistics
PSSM can be derived from the MSAPSSM can be derived from the MSA
A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment.
13 Maricel Kann. Feb-08
MSA contains conserved blocksMSA contains conserved blocks
14 Maricel Kann. Feb-08
Protein Structure AlignmentProtein Structure AlignmentProtein Structure Alignment
αα--helixhelix
ββ--strandstrand
loopsloops
red red sequencesequence
blue blue sequencesequence
Subsequences Subsequences corresponding to corresponding to secondary structure secondary structure elements (SSEs: elements (SSEs: αα--helices and helices and ββ--strands) strands) are more conserved are more conserved than the intervening than the intervening loops.loops.
Protein Sequence ConservationProtein Sequence ConservationOccurs in Blocks with Intervening GapsOccurs in Blocks with Intervening Gaps
15 Maricel Kann. Feb-08
CDD representationCDD representation
gapgap gapgap
11 22
CDD footprintCDD footprint
16 Maricel Kann. Feb-08
SequenceSequence--PSSM alignmentPSSM alignment
A L I G N M E N T
17 Maricel Kann. Feb-08
SequenceSequence--PSSM alignmentPSSM alignment
blockblock blockblock blockblock
quer
yqu
ery
PSSM
Gaps in Query
Gaps in PSSM
18 Maricel Kann. Feb-08
Three Types of Sequence AlignmentsThree Types of Sequence Alignments
SemiSemi--Global Global AlignmentAlignment
SubsequenceSubsequenceOntoOnto
SequenceSequence
LocalLocalAlignment Alignment
SubsequenceSubsequenceToTo
SubsequenceSubsequence
Global AlignmentGlobal Alignment
SequenceSequenceToTo
SequenceSequence
BW Erickson & P SellarsBW Erickson & P Sellars (1983) (1983) Time Warps, String Edits, and MacromoleculesTime Warps, String Edits, and Macromolecules,, p. 55p. 55
19 Maricel Kann. Feb-08
SemiSemi--global Alignmentglobal Alignment
• Finding a complete domain in the query , semi-global, is the natural choice in the context of the protein structure, function and evolution
queryquerysequencesequence
20 Maricel Kann. Feb-08
OutlineOutline
A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches
• RPS-BLAST• HMMer• SALTO
– (Global Blocks Aligned Locally)– Derivation of Statistics– Results
21 Maricel Kann. Feb-08
Reverse PositionReverse Position--Specific BLAST(RPSSpecific BLAST(RPS--BLAST)BLAST)
blockblock blockblock blockblock
quer
yqu
ery
PSSM
rpsBLASTrpsBLAST doesndoesn’’t incorporate the concept of t incorporate the concept of ““blockblock””
A Schaffer A Schaffer et al et al (1999) (1999) Bioinformatics Bioinformatics 15:100015:1000
The role of the PSSM has changed from being the “query” in PSI-BLAST to “subject”, hence the term “reverse” in RPS-BLAST
(Reversed-Position Specific)
22 Maricel Kann. Feb-08
HMMHMM
HMMerHMMer is trained on the CDDis trained on the CDDsequences.sequences.
HMMerHMMer does not specifically incorporate the concept does not specifically incorporate the concept of of ““blockblock””..
23 Maricel Kann. Feb-08
HMMerHMMer’’ss Statistics are a (Poor) Empirical FitStatistics are a (Poor) Empirical Fit
• HMMer fits the EVD distribution parameters λand K to simulated sequences with a Gaussian length distribution.
• HMMer_semi-global Gumbel E-value approximation is sometimes very inaccurate.
ftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdff
24 Maricel Kann. Feb-08
gapgap gapgap
SALTOSALTO
SALTOSALTO
Kann MG et al. Bioinformatics, 21(8):1451Kann MG et al. Bioinformatics, 21(8):1451--6. (2005) 6. (2005)
GG--SALTOSALTO
Structure-based ALignment TOol
25 Maricel Kann. Feb-08
Properties of an Ideal Alignment MethodProperties of an Ideal Alignment Method
• Semi-global alignment method is intrinsically the right tool for searching for domains within proteins.– Local alignment methods match only a portion of
a domain against a query.• Reverse Position-Specific BLAST (rpsBLAST)
• Screening a database for matches needs to be fast.• HMMs have no intrinsic heuristics to speed computation.• The word heuristics in rpsBLAST speed screening and are
available for any local alignment method.
• Accurate Statistics.
26 Maricel Kann. Feb-08
GLOBAL (GLOBAL (GloGlobal bal BBlocks locks AAligned ligned LLocally)ocally)
A semiA semi--global Alignment Method for Queryingglobal Alignment Method for QueryingA Database of Protein DomainsA Database of Protein Domains
with Accurate Statistics with Accurate Statistics
M G. Kann et al (2007) M G. Kann et al (2007) NARNAR, , 35(14):467835(14):4678--46854685..
27 Maricel Kann. Feb-08
OutlineOutline
A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)
• Algorithm and scoring scheme• Derivation of Statistics
– Results
28 Maricel Kann. Feb-08
gapgap gapgap
GLOBAL: aligns blocks locallyGLOBAL: aligns blocks locally
GG--SALTOSALTO
GLOBALGLOBAL
29 Maricel Kann. Feb-08
GLOBALGLOBALQ
UE
RY
PSSMGlobal Algorithm: Global Algorithm: •Uses dynamic programming (DP) to find the alignment of a protein query sequence to all blocks of the PSSM (in order).•Penalty=0 both for unaligned regions of the PSSM at the ends of the blocks and unaligned regions of the queries between blocks.
30 Maricel Kann. Feb-08
GLOBAL: statistics for b blocksGLOBAL: statistics for b blocksQ
UE
RY
PSSM
LLeffeff=effective length=effective length
Assuming the score for each block is independent of each other, GLOBAL estimates total the alignment p-value by convolution algorithm
1,
ˆ ( )i effi b
T M L=
= ∑For b blocks, the total alignment score T is:
( ) ( ) 1/1 !/ ( 1)! !
beffL n b n b= + − −⎡ ⎤⎣ ⎦
n=size of queryb=number of blocks in the PSSM
e.g., n=160, b=3, Leff=89
31 Maricel Kann. Feb-08
OutlineOutline
A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)– Results
• Benchmarking database• ROC (L-ROC) curves• P-value Accuracy
32 Maricel Kann. Feb-08
Database of queries: ~ 10,000 sequences with known structure (from MMDB database).
To define true relationships to a CDD entry a query sequence need to be a structure neighbor (using VAST) of a CD’s protein from for which the structure is known
The resulting test has >300 families with almost 30,000 known true positives.
Benchmarking test setBenchmarking test set
33 Maricel Kann. Feb-08
Benchmarking test setBenchmarking test set
20 40 60 80 1000
200
400
600
800
1000
1200
1400
1600
1800
2000
20 40 60 80 1000
2000
4000
6000
8000
10000
12000
14000
16000
18000
Nu
mb
er
of
tru
e p
osi
tive
s
P e rce ntage o f sequence identity w /V A ST
Num
ber o
f tru
e po
sitiv
es
Percentage of sequence identity (VAST)
34 Maricel Kann. Feb-08
ROC curve for GLOBALROC curve for GLOBAL
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.00 0.01 0.02 0.03 0.04 0.05
GLOBAL HMMer-semi-global HMMer-local RPS-BLAST
Fraction of false positives
Frac
tion
of tr
ue p
ositi
ves
*LROC10000 LROC50000 LROC200000
GLOBAL 0.181 0.224 0.3130.2990.2390.229
HMMer semiglobal 0.185 0.224HMMer local 0.169 0.194rpsBLAST 0.168 0.192
*LROC:SwenssonSwensson RG: RG: Med Phys Med Phys 1996, 23(10):17091996, 23(10):1709--2525
35 Maricel Kann. Feb-08
PP--value accuracyvalue accuracyGLOBAL HMMer
Cd00030 Cd00083Cd00288
1,000,000 simulations using random sequences of length 350
0.1
1
10
100
1000
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00
True P-value
Estim
ated
P-v
alue
/ Tr
ue P
-val
ue
0.1
1
10
100
1000
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00
True P-valueEs
timat
ed P
-val
ue /
True
P-v
alue
36 Maricel Kann. Feb-08
ConclusionsConclusions• The GLOBAL algorithm and p-value provides a
flexible format for semi-global sequence alignments.
• GLOBAL respect block structure but adds flexibility at the ends of each block.
• The GLOBAL p-value is based on local alignment p-values. BLAST heuristics from local alignment therefore apply to GLOBAL.
• While the overall performance is similar to that of HMMer semi-global, GLOBAL has more accurate statistics and the possibility to implement heuristics similar to those used in local methods could make it orders of magnitude faster.
37 Maricel Kann. Feb-08
Future workFuture work
• Implementation of GLOBAL:– “Blockalizer”: creates blocks within the MSA.– Heuristics to increase the speed.
• Optimization of domain discovery: Can we mix and match methods/CDs?
38 Maricel Kann. Feb-08
AcknowledgmentsAcknowledgmentsSALTO:SALTO:• Stephen Altschul, Anna Panchenko, Paul Thiessen and Steve
Bryant.GLOBAL:GLOBAL:• John Spouge, Sergey Sheetlin and Yonil Park.PROTEIN INTERACTIONS:PROTEIN INTERACTIONS:Predicting protein-protein interaction by searching evolutionary tree
automorphism space• Teresa Przytycka and Raja Jothi.Predicting protein domain interactions from co-evolution of conserved
regions: Teresa Przytycka, Praveen Cherukuri and Raja Jothi.
• UMBC Computational Biology lab team.
39 Maricel Kann. Feb-08
KannKann’’ss Computational Biology lab.Computational Biology lab.