Bioinformatics of Protein Domains: New Computational Approach for

39
Maricel Kann. Feb-08 Bioinformatics of Protein Domains: Bioinformatics of Protein Domains: New Computational Approach for the New Computational Approach for the Detection of Protein Domains Detection of Protein Domains Maricel Kann Assistant Professor University of Maryland, Baltimore County [email protected]

Transcript of Bioinformatics of Protein Domains: New Computational Approach for

Page 1: Bioinformatics of Protein Domains: New Computational Approach for

Maricel Kann. Feb-08

Bioinformatics of Protein Domains:Bioinformatics of Protein Domains:New Computational Approach for the New Computational Approach for the

Detection of Protein Domains Detection of Protein Domains

Maricel KannAssistant Professor

University of Maryland, Baltimore [email protected]

Page 2: Bioinformatics of Protein Domains: New Computational Approach for

2 Maricel Kann. Feb-08

GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTAATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATATATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCACGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAACAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTACTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAACCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATACTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAA

The Human Genome ProjectThe Human Genome Project

Page 3: Bioinformatics of Protein Domains: New Computational Approach for

3 Maricel Kann. Feb-08© Sidney Harris

Page 4: Bioinformatics of Protein Domains: New Computational Approach for

4 Maricel Kann. Feb-08

Protein ClassificationProtein Classification

A L I G N M E N T

A L I G G N M E N

QUERYSet of related sequences or protein family from database

A L I G N M E N T

A L I G G N M E N

4 3 4 7 1 2 -2 0 0

Alignment AlgorithmScoring Function

Accurate Statistics

A L I G - N M E N T

A L I G G N M E N -

score=19

PAM: PAM: DayhoffDayhoff et al. (1978); BLOSUM: et al. (1978); BLOSUM: HenikoffHenikoff & & HenikoffHenikoff (1992);(1992);OPTIMA:KannOPTIMA:Kann et al.et al. (2000).(2000).

Page 5: Bioinformatics of Protein Domains: New Computational Approach for

5 Maricel Kann. Feb-08

Significance of a scoreSignificance of a score

Estimated number of non-related sequences in the database that score higher than the query

( )Q RE p S S D= <

D= size of database

Page 6: Bioinformatics of Protein Domains: New Computational Approach for

6 Maricel Kann. Feb-08

Alignments’ scores

# of

alig

nmen

ts w

ith sc

ore

S

S SQ

random scores

( ) 1 exp[ ]RSQ Rp S S KMNe λ−< = − −

Page 7: Bioinformatics of Protein Domains: New Computational Approach for

7 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction

• Definition of protein domain.• Main features of the Conserved domain database (CDD)• Position specific scoring matrices (PSSM)• Classification of alignment methods

– Current methods for protein domain searches– Our approach (Global Blocks Aligned Locally)– Results

Page 8: Bioinformatics of Protein Domains: New Computational Approach for

8 Maricel Kann. Feb-08

The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core.

Page 9: Bioinformatics of Protein Domains: New Computational Approach for

9 Maricel Kann. Feb-08

Conserved DomainsConserved Domains• In 1974 Michael Rossman recognized the NADH

binding domain in several dehydrogenases (named after him).

• Conserved domains are determined by sequencecomparative analysis.

• Molecular evolution uses such domains as building blocks

• They may be recombined in different arrangements to make proteins with different functions.

• Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains.

Page 10: Bioinformatics of Protein Domains: New Computational Approach for

10 Maricel Kann. Feb-08

CDD: a collection of domain multiple alignments linkedto protein 3D structure

Page 11: Bioinformatics of Protein Domains: New Computational Approach for

11 Maricel Kann. Feb-08MarchlerMarchler--Bauer Bauer et al et al (2003) (2003) NAR NAR 383:387383:387

heme-binding site

It combines information about protein sequence, their conservationpatterns across evolution and the protein structure and provide useful functional annotation.

Page 12: Bioinformatics of Protein Domains: New Computational Approach for

12 Maricel Kann. Feb-08

Protein ClassificationProtein Classification

QUERYSet of related sequences or protein family from database

Alignment AlgorithmScoring Function

Accurate Statistics

PSSM can be derived from the MSAPSSM can be derived from the MSA

A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment.

Page 13: Bioinformatics of Protein Domains: New Computational Approach for

13 Maricel Kann. Feb-08

MSA contains conserved blocksMSA contains conserved blocks

Page 14: Bioinformatics of Protein Domains: New Computational Approach for

14 Maricel Kann. Feb-08

Protein Structure AlignmentProtein Structure AlignmentProtein Structure Alignment

αα--helixhelix

ββ--strandstrand

loopsloops

red red sequencesequence

blue blue sequencesequence

Subsequences Subsequences corresponding to corresponding to secondary structure secondary structure elements (SSEs: elements (SSEs: αα--helices and helices and ββ--strands) strands) are more conserved are more conserved than the intervening than the intervening loops.loops.

Protein Sequence ConservationProtein Sequence ConservationOccurs in Blocks with Intervening GapsOccurs in Blocks with Intervening Gaps

Page 15: Bioinformatics of Protein Domains: New Computational Approach for

15 Maricel Kann. Feb-08

CDD representationCDD representation

gapgap gapgap

11 22

CDD footprintCDD footprint

Page 16: Bioinformatics of Protein Domains: New Computational Approach for

16 Maricel Kann. Feb-08

SequenceSequence--PSSM alignmentPSSM alignment

A L I G N M E N T

Page 17: Bioinformatics of Protein Domains: New Computational Approach for

17 Maricel Kann. Feb-08

SequenceSequence--PSSM alignmentPSSM alignment

blockblock blockblock blockblock

quer

yqu

ery

PSSM

Gaps in Query

Gaps in PSSM

Page 18: Bioinformatics of Protein Domains: New Computational Approach for

18 Maricel Kann. Feb-08

Three Types of Sequence AlignmentsThree Types of Sequence Alignments

SemiSemi--Global Global AlignmentAlignment

SubsequenceSubsequenceOntoOnto

SequenceSequence

LocalLocalAlignment Alignment

SubsequenceSubsequenceToTo

SubsequenceSubsequence

Global AlignmentGlobal Alignment

SequenceSequenceToTo

SequenceSequence

BW Erickson & P SellarsBW Erickson & P Sellars (1983) (1983) Time Warps, String Edits, and MacromoleculesTime Warps, String Edits, and Macromolecules,, p. 55p. 55

Page 19: Bioinformatics of Protein Domains: New Computational Approach for

19 Maricel Kann. Feb-08

SemiSemi--global Alignmentglobal Alignment

• Finding a complete domain in the query , semi-global, is the natural choice in the context of the protein structure, function and evolution

queryquerysequencesequence

Page 20: Bioinformatics of Protein Domains: New Computational Approach for

20 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches

• RPS-BLAST• HMMer• SALTO

– (Global Blocks Aligned Locally)– Derivation of Statistics– Results

Page 21: Bioinformatics of Protein Domains: New Computational Approach for

21 Maricel Kann. Feb-08

Reverse PositionReverse Position--Specific BLAST(RPSSpecific BLAST(RPS--BLAST)BLAST)

blockblock blockblock blockblock

quer

yqu

ery

PSSM

rpsBLASTrpsBLAST doesndoesn’’t incorporate the concept of t incorporate the concept of ““blockblock””

A Schaffer A Schaffer et al et al (1999) (1999) Bioinformatics Bioinformatics 15:100015:1000

The role of the PSSM has changed from being the “query” in PSI-BLAST to “subject”, hence the term “reverse” in RPS-BLAST

(Reversed-Position Specific)

Page 22: Bioinformatics of Protein Domains: New Computational Approach for

22 Maricel Kann. Feb-08

HMMHMM

HMMerHMMer is trained on the CDDis trained on the CDDsequences.sequences.

HMMerHMMer does not specifically incorporate the concept does not specifically incorporate the concept of of ““blockblock””..

Page 23: Bioinformatics of Protein Domains: New Computational Approach for

23 Maricel Kann. Feb-08

HMMerHMMer’’ss Statistics are a (Poor) Empirical FitStatistics are a (Poor) Empirical Fit

• HMMer fits the EVD distribution parameters λand K to simulated sequences with a Gaussian length distribution.

• HMMer_semi-global Gumbel E-value approximation is sometimes very inaccurate.

ftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdff

Page 24: Bioinformatics of Protein Domains: New Computational Approach for

24 Maricel Kann. Feb-08

gapgap gapgap

SALTOSALTO

SALTOSALTO

Kann MG et al. Bioinformatics, 21(8):1451Kann MG et al. Bioinformatics, 21(8):1451--6. (2005) 6. (2005)

GG--SALTOSALTO

Structure-based ALignment TOol

Page 25: Bioinformatics of Protein Domains: New Computational Approach for

25 Maricel Kann. Feb-08

Properties of an Ideal Alignment MethodProperties of an Ideal Alignment Method

• Semi-global alignment method is intrinsically the right tool for searching for domains within proteins.– Local alignment methods match only a portion of

a domain against a query.• Reverse Position-Specific BLAST (rpsBLAST)

• Screening a database for matches needs to be fast.• HMMs have no intrinsic heuristics to speed computation.• The word heuristics in rpsBLAST speed screening and are

available for any local alignment method.

• Accurate Statistics.

Page 26: Bioinformatics of Protein Domains: New Computational Approach for

26 Maricel Kann. Feb-08

GLOBAL (GLOBAL (GloGlobal bal BBlocks locks AAligned ligned LLocally)ocally)

A semiA semi--global Alignment Method for Queryingglobal Alignment Method for QueryingA Database of Protein DomainsA Database of Protein Domains

with Accurate Statistics with Accurate Statistics

M G. Kann et al (2007) M G. Kann et al (2007) NARNAR, , 35(14):467835(14):4678--46854685..

Page 27: Bioinformatics of Protein Domains: New Computational Approach for

27 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)

• Algorithm and scoring scheme• Derivation of Statistics

– Results

Page 28: Bioinformatics of Protein Domains: New Computational Approach for

28 Maricel Kann. Feb-08

gapgap gapgap

GLOBAL: aligns blocks locallyGLOBAL: aligns blocks locally

GG--SALTOSALTO

GLOBALGLOBAL

Page 29: Bioinformatics of Protein Domains: New Computational Approach for

29 Maricel Kann. Feb-08

GLOBALGLOBALQ

UE

RY

PSSMGlobal Algorithm: Global Algorithm: •Uses dynamic programming (DP) to find the alignment of a protein query sequence to all blocks of the PSSM (in order).•Penalty=0 both for unaligned regions of the PSSM at the ends of the blocks and unaligned regions of the queries between blocks.

Page 30: Bioinformatics of Protein Domains: New Computational Approach for

30 Maricel Kann. Feb-08

GLOBAL: statistics for b blocksGLOBAL: statistics for b blocksQ

UE

RY

PSSM

LLeffeff=effective length=effective length

Assuming the score for each block is independent of each other, GLOBAL estimates total the alignment p-value by convolution algorithm

1,

ˆ ( )i effi b

T M L=

= ∑For b blocks, the total alignment score T is:

( ) ( ) 1/1 !/ ( 1)! !

beffL n b n b= + − −⎡ ⎤⎣ ⎦

n=size of queryb=number of blocks in the PSSM

e.g., n=160, b=3, Leff=89

Page 31: Bioinformatics of Protein Domains: New Computational Approach for

31 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)– Results

• Benchmarking database• ROC (L-ROC) curves• P-value Accuracy

Page 32: Bioinformatics of Protein Domains: New Computational Approach for

32 Maricel Kann. Feb-08

Database of queries: ~ 10,000 sequences with known structure (from MMDB database).

To define true relationships to a CDD entry a query sequence need to be a structure neighbor (using VAST) of a CD’s protein from for which the structure is known

The resulting test has >300 families with almost 30,000 known true positives.

Benchmarking test setBenchmarking test set

Page 33: Bioinformatics of Protein Domains: New Computational Approach for

33 Maricel Kann. Feb-08

Benchmarking test setBenchmarking test set

20 40 60 80 1000

200

400

600

800

1000

1200

1400

1600

1800

2000

20 40 60 80 1000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Nu

mb

er

of

tru

e p

osi

tive

s

P e rce ntage o f sequence identity w /V A ST

Num

ber o

f tru

e po

sitiv

es

Percentage of sequence identity (VAST)

Page 34: Bioinformatics of Protein Domains: New Computational Approach for

34 Maricel Kann. Feb-08

ROC curve for GLOBALROC curve for GLOBAL

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.00 0.01 0.02 0.03 0.04 0.05

GLOBAL HMMer-semi-global HMMer-local RPS-BLAST

Fraction of false positives

Frac

tion

of tr

ue p

ositi

ves

*LROC10000 LROC50000 LROC200000

GLOBAL 0.181 0.224 0.3130.2990.2390.229

HMMer semiglobal 0.185 0.224HMMer local 0.169 0.194rpsBLAST 0.168 0.192

*LROC:SwenssonSwensson RG: RG: Med Phys Med Phys 1996, 23(10):17091996, 23(10):1709--2525

Page 35: Bioinformatics of Protein Domains: New Computational Approach for

35 Maricel Kann. Feb-08

PP--value accuracyvalue accuracyGLOBAL HMMer

Cd00030 Cd00083Cd00288

1,000,000 simulations using random sequences of length 350

0.1

1

10

100

1000

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00

True P-value

Estim

ated

P-v

alue

/ Tr

ue P

-val

ue

0.1

1

10

100

1000

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00

True P-valueEs

timat

ed P

-val

ue /

True

P-v

alue

Page 36: Bioinformatics of Protein Domains: New Computational Approach for

36 Maricel Kann. Feb-08

ConclusionsConclusions• The GLOBAL algorithm and p-value provides a

flexible format for semi-global sequence alignments.

• GLOBAL respect block structure but adds flexibility at the ends of each block.

• The GLOBAL p-value is based on local alignment p-values. BLAST heuristics from local alignment therefore apply to GLOBAL.

• While the overall performance is similar to that of HMMer semi-global, GLOBAL has more accurate statistics and the possibility to implement heuristics similar to those used in local methods could make it orders of magnitude faster.

Page 37: Bioinformatics of Protein Domains: New Computational Approach for

37 Maricel Kann. Feb-08

Future workFuture work

• Implementation of GLOBAL:– “Blockalizer”: creates blocks within the MSA.– Heuristics to increase the speed.

• Optimization of domain discovery: Can we mix and match methods/CDs?

Page 38: Bioinformatics of Protein Domains: New Computational Approach for

38 Maricel Kann. Feb-08

AcknowledgmentsAcknowledgmentsSALTO:SALTO:• Stephen Altschul, Anna Panchenko, Paul Thiessen and Steve

Bryant.GLOBAL:GLOBAL:• John Spouge, Sergey Sheetlin and Yonil Park.PROTEIN INTERACTIONS:PROTEIN INTERACTIONS:Predicting protein-protein interaction by searching evolutionary tree

automorphism space• Teresa Przytycka and Raja Jothi.Predicting protein domain interactions from co-evolution of conserved

regions: Teresa Przytycka, Praveen Cherukuri and Raja Jothi.

• UMBC Computational Biology lab team.

Page 39: Bioinformatics of Protein Domains: New Computational Approach for

39 Maricel Kann. Feb-08

KannKann’’ss Computational Biology lab.Computational Biology lab.