Predictive Methods Using Protein Sequences Unit 23 BIOL221T: Advanced Bioinformatics for...

Predictive Methods Using Predictive Methods Using Protein SequencesProtein Sequences

Unit 23Unit 23

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

IntroductionIntroduction

Each protein starts its life as a shapeless Each protein starts its life as a shapeless string of amino acids – more exactly, string of amino acids – more exactly, residuesresidues

Primary > Secondary > 3D StructurePrimary > Secondary > 3D Structure Function depends on 3D structureFunction depends on 3D structure 3D structure can be “guessed” from 3D structure can be “guessed” from

Sequence, but more info is needed – folding Sequence, but more info is needed – folding environment, chaperonines, etcenvironment, chaperonines, etc

Partial structural predictions can be also Partial structural predictions can be also helpfulhelpful

Amino Acid versus Amino Acid versus ResidueResidue

CCOOHH2N

H

R

CCO N

H

R

H

Amino Acid Residue

--- next lecture: all on structures ---

From previous lecture:From previous lecture:

Often, it’s enough to knowing the sequence of Often, it’s enough to knowing the sequence of the first 6 amino acids to identify the protein the first 6 amino acids to identify the protein

““Terminal sequence identification” approach: Terminal sequence identification” approach: a “label” (“backpack”) is chemically attached a “label” (“backpack”) is chemically attached to the end. to the end.

label-AAlabel-AA11-AA-AA22- …Aa- …Aan n

label-AAlabel-AA1 1

label-AAlabel-AA11-AA-AA22-AA-AA33

……

Riptide Algorithm Riptide Algorithm (D.Carter (D.Carter et al)et al)

Terminal Amino Acid Sequence Prediction, e.g. MQIFVK

Riptide Sequencing Algorithm

mass/charge

Occ

urr

en

ce

co

un

ts

0

1000

2000

3000

4000

5000

6000

200 400 600 800 1000 1200

Cou

nts

m/z (amu)

b1

+

a2

+

a3

+

b4

+

b5

+

H3C

O

HN

O

OH

HN

P(L/I)SDQ...O

HN

H2NNH2

label+MQIFlabel+MQIFV

label+MQIFVK

label+MQIlabel+MQ

label+Mlabel

massF

Mass spec data from an unknown protein

(e.g. protein shown is ubiquitin whose amino acid sequence starts from MQIFVK…)See e.g., >gi|37571|emb|CAA44911.1| ubiquitin [Homo sapiens]

Simple algorithmSimple algorithm

do x = 0 to 2do y = 0 to 2

do z = 0 to 2Pxyz = MS(mx) + MS(mx+my) + MS(mx

+my+mz)

Calculates a score value for each of the 203 amino acid sequences in a nested loop fashion. Suppose there are only 3 amino acids (0, 1 and 2) with masses m0, m1 and m2 (33= 27 permutations), no attached label. For the x-y-z sequence, the scoring function rewards the presence of the likely molecular fragments x, x-y and x-y-z.

The sequence generating the highest scoring Pxyz is reported as the most likely sequence for the unknown protein.

Redundant lookupsRedundant lookupsP000

P001

P002

P010

P011

P012

P020

P021

P022

P100

P101

P102

P110

P111

P112

P120

P121

P122

P200

P201

P202

P210

P211

P212

P220

P221

P222

+MS(m2+m2+m2)+MS(m2+m2+m1)+MS(m2+m2+m0)+MS(m2+m1+m2)+MS(m2+m1+m1)+MS(m2+m1+m0)+MS(m2+m0+m2)+MS(m2+m0+m1)+MS(m2+m0+m0)+MS(m1+m2+m2)+MS(m1+m2+m1)+MS(m1+m2+m0)+MS(m1+m1+m2)+MS(m1+m1+m1)+MS(m1+m1+m0)+MS(m1+m0+m2)+MS(m1+m0+m1)+MS(m1+m0+m0)+MS(m0+m2+m2)

+MS(m0+m2+m0)+MS(m0+m1+m2)+MS(m0+m1+m1)+MS(m0+m1+m0)+MS(m0+m0+m2)+MS(m0+m0+m1)+MS(m0+m0+m0)

MS(m2) P2

+MS(m2+m1)

+MS(m2+m0)

MS(m1) P1

+MS(m1+m2)

+MS(m1+m0)

MS(m0) P0

+MS(m0+m2)

+MS(m0+m1)

+MS(m0+m0) P00

MAX

Highest Scoring Sequence

+MS(m0+m2+m1)P02

P01

P10

P12

P20

P21

+MS(m1+m1) P11

+MS(m2+m2) P22

All highlighted MS() calls are equal due to the commutativity of addition.

P000P000

P001P001

P002P002

P010P010

P011P011

P012P012

P020P020

P021P021

P022P022

P100P100

P101P101

P102P102

P110P110

P111P111

P112P112

P120P120

P121P121

P122P122

P200P200

P201P201

P202P202

P210P210

P211P211

P212P212

P220P220

P221P221

P222P222

+MS(m2+m2+m2)+MS(m2+m2+m2)+MS(m2+m2+m1)+MS(m2+m2+m1)+MS(m2+m2+m0)+MS(m2+m2+m0)+MS(m2+m1+m2)+MS(m2+m1+m2)+MS(m2+m1+m1)+MS(m2+m1+m1)+MS(m2+m1+m0)+MS(m2+m1+m0)+MS(m2+m0+m2)+MS(m2+m0+m2)+MS(m2+m0+m1)+MS(m2+m0+m1)+MS(m2+m0+m0)+MS(m2+m0+m0)+MS(m1+m2+m2)+MS(m1+m2+m2)+MS(m1+m2+m1)+MS(m1+m2+m1)+MS(m1+m2+m0)+MS(m1+m2+m0)+MS(m1+m1+m2)+MS(m1+m1+m2)+MS(m1+m1+m1)+MS(m1+m1+m1)+MS(m1+m1+m0)+MS(m1+m1+m0)+MS(m1+m0+m2)+MS(m1+m0+m2)+MS(m1+m0+m1)+MS(m1+m0+m1)+MS(m1+m0+m0)+MS(m1+m0+m0)+MS(m0+m2+m2)+MS(m0+m2+m2)

+MS(m0+m2+m0)+MS(m0+m2+m0)+MS(m0+m1+m2)+MS(m0+m1+m2)+MS(m0+m1+m1)+MS(m0+m1+m1)+MS(m0+m1+m0)+MS(m0+m1+m0)+MS(m0+m0+m2)+MS(m0+m0+m2)+MS(m0+m0+m1)+MS(m0+m0+m1)+MS(m0+m0+m0)+MS(m0+m0+m0)

MS(m2) P2

MS(m2) P2

+MS(m2+m1)

+MS(m2+m0)

MS(m1) P1

MS(m1) P1

+MS(m1+m2)

+MS(m1+m0)

MS(m0) P0

MS(m0) P0

+MS(m0+m2)

+MS(m0+m1)

+MS(m0+m0) P00

+MS(m0+m0) P00

MAX

Highest Scoring Sequence

+MS(m0+m2+m1)+MS(m0+m2+m1)P02P02

P01P01

P10P10

P12P12

P20P20

P21P21

+MS(m1+m1) P11

+MS(m1+m1) P11

+MS(m2+m2) P22

+MS(m2+m2) P22

All highlighted MS() calls are equal due to the commutativity of addition.

Need combinations Need combinations

P000

MS(m0) P00P0

+MS(m0+m0) +MS(m0+m0+m0)

P001

P002

+MS(m0+m0+m1)

+MS(m0+m0+m2)

+MS(m0+m1)

MAX

C012

+MS(m0+m2)

+MS(m0+m1+m2)MS(m1) P1

MS(m2) P2

+MS(m1+m0)

+MS(m2+m0)

+MS(m1+m2)

+MS(m2+m1)

MAX

Cxyz= MAX(Pxyz ,Pxzy ,Pyxz ,

Pyzx ,Pzxy ,Pzyx)P21

P20

P12

P10

P02

P01

Highest Scoring Combination

MS() calls collapsed from 6 to 1.

...

...

P000

MS(m0) P00P0

+MS(m0+m0) +MS(m0+m0+m0)

P001

P002

+MS(m0+m0+m1)

+MS(m0+m0+m2)

+MS(m0+m1)

MAX

C012

+MS(m0+m2)

+MS(m0+m1+m2)MS(m1) P1

MS(m2) P2

+MS(m1+m0)

+MS(m2+m0)

+MS(m1+m2)

+MS(m2+m1)

MAX

Cxyz= MAX(Pxyz ,Pxzy ,Pyxz ,

Pyzx ,Pzxy ,Pzyx)P21P21

P20P20

P12P12

P10P10

P02P02

P01P01


MS() calls collapsed from 6 to 1.

...

...

...

...

Riptide “combination Riptide “combination space” sequencing space” sequencing

MS(m0) C0

+MS(m0+m1

)

MS(m1) C1

MS(m2) C2

+MS(m0+m2

)

+MS(m1+m2)MAX

MAX

MAX

C001

C00

+MS(m0+m0) +MS(m0+m0+m0)

C002

C011

+MS(m0+m0+m1)

C022

C111

C112

C122

C222

C11

C22

+MS(m1+m1)

+MS(m2+m2)

C000

MAX

+MS(m0+m0+m2)

+MS(m0+m1+m1)

+MS(m0+m2+m2)MAX

+MS(m1+m1+m2)MAX

+MS(m1+m2+m2)MAX

+MS(m1+m1+m1)

+MS(m2+m2+m2)

MAX

MAXC01

C012C02

C12

+MS(m0+m1+m2)MAX MAX


Highest scoring ordered sequence is easily derived from combination scores.

MS(m0) C0

+MS(m0+m1

)

MS(m1) C1

MS(m2) C2

+MS(m0+m2

)

+MS(m1+m2)MAX

MAX

MAX

C001

C00

+MS(m0+m0) +MS(m0+m0+m0)

C002

C011

+MS(m0+m0+m1)

C022

C111

C112

C122

C222

C11

C22

+MS(m1+m1)

+MS(m2+m2)

C000

MAX

+MS(m0+m0+m2)

+MS(m0+m1+m1)

+MS(m0+m2+m2)MAX

+MS(m1+m1+m2)MAX

+MS(m1+m2+m2)MAX

+MS(m1+m1+m1)

+MS(m2+m2+m2)

MAX

MAXC01

C012C02

C12

+MS(m0+m1+m2)MAX MAX


Highest scoring ordered sequence is easily derived from combination scores.

Crash course on Crash course on biostatisticsbiostatistics

Statistics – analyzing data sets in terms of Statistics – analyzing data sets in terms of the relationships between the individual the relationships between the individual pointspoints

Variance & Standard Deviation; Co-Variance & Standard Deviation; Co-variancevariance

Machine Learning approaches (supervised Machine Learning approaches (supervised & unsupervised)& unsupervised)

Clustering vs Classificaion, PCAClustering vs Classificaion, PCA P-values & E-values, Scores via False P-values & E-values, Scores via False

positives, negativespositives, negatives

PCAPCA principal components analysis (PCA)principal components analysis (PCA) is a is a

technique that can be used to simplify a dataset technique that can be used to simplify a dataset It is a linear transformation that chooses a new It is a linear transformation that chooses a new

coordinate system for the data set such that coordinate system for the data set such that greatest variance by any projection of the data set greatest variance by any projection of the data set

comes to lie on the first axis (then called the first comes to lie on the first axis (then called the first principal component), principal component),

the second greatest variance on the second axis, and the second greatest variance on the second axis, and so on. so on.

PCA can be used for reducing dimensionality by PCA can be used for reducing dimensionality by eliminating the later principal components.eliminating the later principal components.

Applications: face recognition, patterns findingsApplications: face recognition, patterns findings

What is Cluster Analysis?What is Cluster Analysis?

Cluster: a collection of data objectsCluster: a collection of data objects Similar to the objects in the same cluster (Intraclass Similar to the objects in the same cluster (Intraclass

similarity)similarity) Dissimilar to the objects in other clusters (Interclass Dissimilar to the objects in other clusters (Interclass

dissimilarity)dissimilarity) Cluster analysisCluster analysis

StatisticalStatistical method for grouping a set of data objects into method for grouping a set of data objects into clustersclusters

A good clustering method produces high quality clusters A good clustering method produces high quality clusters

with high with high intraclassintraclass similarity and low similarity and low interclassinterclass similarity similarity Clustering is Clustering is unsupervised classificationunsupervised classification Can be a stand-alone tool or as a preprocessing step Can be a stand-alone tool or as a preprocessing step

for other algorithmsfor other algorithms

Group objects according to Group objects according to their similaritytheir similarity

Cluster:a set of objectsthat are similar to each otherand separatedfrom the otherobjects.

Example: green/red data pointswere generatedfrom two differentnormal distributions

K-MeansK-Means Clustering Clustering The meaning of ‘K-means’The meaning of ‘K-means’

Why it is called ‘K-means’ clustering: K Why it is called ‘K-means’ clustering: K points are used to represent the points are used to represent the clustering result; each point corresponds clustering result; each point corresponds to the centre (mean) of a clusterto the centre (mean) of a cluster

Each point is assigned to the cluster Each point is assigned to the cluster with the closest center pointwith the closest center point

The number K, must be specifiedThe number K, must be specified Basic algorithmBasic algorithm

K-means clustering

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Protocol 1. Calculate pairwise distance matrix

2. Find the two most similar genes or clusters

3. Merge the two selected clusters to produce a new cluster

4. Calculate pairwise distance matrix involving the new cluster

5. Repeat steps 2-4 until all objects are in one cluster

6. The clustering sequence is represented by a hierarchical tree – dendrogram.

Hierarchical clustering

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

EXAMPLE

(M)ANOVA(M)ANOVA

The analysis of variance technique in One-Way The analysis of variance technique in One-Way Analysis of Variance (ANOVA) takes a set of Analysis of Variance (ANOVA) takes a set of grouped data and determine whether the grouped data and determine whether the mean of a variable differs significantly mean of a variable differs significantly between groupsbetween groups

Often there are multiple variables and you are Often there are multiple variables and you are interested in determining whether the entire interested in determining whether the entire set of means is different from one group to the set of means is different from one group to the nextnext There is a multivariate version of analysis of There is a multivariate version of analysis of

variance that can address that problem (MANOVA)variance that can address that problem (MANOVA)

NCI: (Srinivas et Sirvastava., Cancer Biomarker Research Group, review article,

Vol. 8, 1160-69, 2002)

Biomarkers are biological molecules that are indicators of

physiologic state and also change during a disease process.

The utility of a biomarker lies in its ability to provide an

early indication of a disease, to monitor disease progression,

to provide ease of detection, and to provide a factor

measurable across populations.

What is a Biomarker?What is a Biomarker?

• Genomic

•DNA (e.g., BRCA-I gene mutations)

•RNA (gene expression, up/down regulation)

• Proteomic

•Peptides (e.g., PIF)

•Proteins (e.g., HER2/neu, PSA, CA-125)

• Metabonomic

•Small molecules, metabolites (e.g., glucose, cholesterol, cortisol)

What Types of What Types of Biomarkers Exist?Biomarkers Exist?

MS-based

Bioinformatics tools can Bioinformatics tools can predict:predict:

Secondary StructureSecondary Structure 3D Structure3D Structure Interaction SitesInteraction Sites Solvent AccessibilitySolvent Accessibility Transmembrane SegmentsTransmembrane Segments Subcellular LocalizationSubcellular Localization FunctionFunction

What Can Be Predicted?What Can Be Predicted?

O-Glycosylation SitesO-Glycosylation Sites Phosphorylation SitesPhosphorylation Sites Protease Cut SitesProtease Cut Sites Nuclear Targeting SitesNuclear Targeting Sites Mitochondrial Target Mitochondrial Target

SitesSites Chloroplast Target SitesChloroplast Target Sites Signal SequencesSignal Sequences Signal Sequence Cleav.Signal Sequence Cleav. Peroxisome Target SitesPeroxisome Target Sites

ER Targeting SitesER Targeting Sites Transmembrane SitesTransmembrane Sites Tyrosine Sulfation SitesTyrosine Sulfation Sites GPInositol Anchor SitesGPInositol Anchor Sites PEST sitesPEST sites Coil-Coil SitesCoil-Coil Sites T-Cell/MHC EpitopesT-Cell/MHC Epitopes Protein LifetimeProtein Lifetime And a lot more….And a lot more….

Sequence Feature Sequence Feature ServersServers

T-Cell Epitope Prediction – no longer existsT-Cell Epitope Prediction – no longer exists http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/

home.htmhome.htm

B cell epitope prediction from 3D structures:B cell epitope prediction from 3D structures: http://www.cbs.dtu.dk/services/DiscoTope/

Predictions of promiscous MHCI-restricted Predictions of promiscous MHCI-restricted epitopes:epitopes: http://immunax.dfci.harvard.edu/PEPVAC/

O-Glycosylation PredictionO-Glycosylation Prediction http://www.cbs.dtu.dk/services/NetOGlyc/http://www.cbs.dtu.dk/services/NetOGlyc/

Phosphorylation PredictionPhosphorylation Prediction http://www.cbs.dtu.dk/services/NetPhos/http://www.cbs.dtu.dk/services/NetPhos/

Secondary StructureSecondary Structure

PHDsec: PHDsec: http://www.predictprotein.org Workbench: Workbench: http://workbench.sdsc.edu/

NGWB: NGWB: www.ngbw.org PROFsec: PROFsec:

http://cubic.bioc.columbia.edu/predictprotein/ PSIPRED (PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/ ) ) Jpred Jpred (http://www.compbio.dundee.ac.uk/~www-(http://www.compbio.dundee.ac.uk/~www-

jpred/)jpred/)

History of secondary structure History of secondary structure prediction:prediction:

The 1The 1stst generation: physicochemical generation: physicochemical principles, expert rules, and statistics principles, expert rules, and statistics (1970s, 50% accuracy)(1970s, 50% accuracy)

The 2The 2ndnd generation methods: sliding window generation methods: sliding window that walked through the entire sequence. that walked through the entire sequence. (1980s into the 1990s, ~60% accuracy). (1980s into the 1990s, ~60% accuracy).

The 3The 3rdrd generation methods use multiple generation methods use multiple sequence alignments, take advantage of the sequence alignments, take advantage of the evolutionary information (~75% accuracy). evolutionary information (~75% accuracy).

Tutorials/Description:Tutorials/Description:

PredictProtein : sequence PredictProtein : sequence analysis, prediction of protein analysis, prediction of protein function and structurefunction and structure

The PredictProtein Server. The PredictProtein Server. Nucleic Acids Research 32(Web Nucleic Acids Research 32(Web Server issue):W321-W326Server issue):W321-W326..

Interaction sitesInteraction sites

http://cubic.bioc.columbia.edu/services/

See also:See also:http://bioinformatics.ca/links_directory/narweb2007/

(same for 2006-2003) (same for 2006-2003) http://gemdock.life.nctu.edu.tw/3D-http://gemdock.life.nctu.edu.tw/3D-

partner/vers1/index.php (predicts partner/vers1/index.php (predicts interaction partners)interaction partners)

http://ef-site.hgc.jp/eF-seek/index.jsphttp://ef-site.hgc.jp/eF-seek/index.jsp

Solvent AccessibilitySolvent Accessibility

PHDacc PHDacc (http://www.predictprotein.org/(http://www.predictprotein.org/

PROFacc PROFacc (http://cubic.bioc.columbia.edu/predi(http://cubic.bioc.columbia.edu/predictprotein/ )ctprotein/ )

Jpred Jpred (http://www.compbio.dundee.ac.uk/(http://www.compbio.dundee.ac.uk/~www-jpred/ )~www-jpred/ )

Transmembrane SegmentsTransmembrane Segments

TopPred TopPred (http://bioweb.pasteur.fr/seqanal/int(http://bioweb.pasteur.fr/seqanal/interfaces/toppred.html )erfaces/toppred.html )

TMHMM TMHMM (http://www.cbs.dtu.dk/services/TM(http://www.cbs.dtu.dk/services/TMHMM/ )HMM/ )

Membrane Helix PredictionMembrane Helix Prediction http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/

Subcellular LocalizationSubcellular Localization

PSORT: http://psort.ims.u-PSORT: http://psort.ims.u-tokyo.ac.jp/tokyo.ac.jp/

TargetP: TargetP: http://www.cbs.dtu.dk/services/TargetP/http://www.cbs.dtu.dk/services/TargetP/

http://cubic.bioc.columbia.edu/db/http://cubic.bioc.columbia.edu/db/LOC3d/index.htmlLOC3d/index.html

Predictive Methods Using Protein Sequences Unit 23 BIOL221T: Advanced Bioinformatics for...

Documents

Transcript of Predictive Methods Using Protein Sequences Unit 23 BIOL221T: Advanced Bioinformatics for...