Protein Sequence Motifs

www.bioinformatics.nl

Protein Sequence Motifs

Aalt-Jan van DijkPlant Research International, Wageningen UR

Biometris, Wageningen [email protected]

www.bioinformatics.nlwww.bioinformatics.nl

Genomics Next Generation Sequencing Genome assembly & annotation (Comparative) genome analysis SNP analysis, marker development

Technology Computational infrastructure Database development Webbased analysis tools Software- development Workflow management systems machine learning

Integrated analysis of omics datasets Transcriptomics

• Alternative splicing• EST analysis

Proteomics• Data (pre-)processing pipelining• Alternative splicing• Protein interactions networks

Metabolomics• Database- development• Data (pre-)processing pipelining• Metabolite and pathway-identification

Systems biology network modelling (bottom-up)

• Protein interactions networks

Plant Bioinformatics


Protein complex structures Protein-protein docking Correlated mutations

Interaction site prediction/analysis Protein-protein

interactions Protein-DNA interactions

Motif search Enzyme active sites

My research


Overview

Protein Motif Searching Hydrophobicity & Transmembrane Domains Protein Interactions Sequence-motifs to predict interaction sites

Secondary Structure Prediction


Protein Motif Searching


What is a motif?

A motif is a description of a particular element of a protein that contains a specific sequence pattern

Motifs are identified by 3D structural alignment Multiple sequence alignment Pattern searching programs


C C P C


Strict consensus pattern use only strictly conserved residues

C--QASCDGIPLKMNDCC---VTCEGLPMRMDQCCERTLGCQPMPVH---C

CxxxxxCxxxPxxxxxC


C C P C


Strict consensus pattern use only strictly conserved residues

But what about: variable residues? gaps?

C--QASCDGIPLKMNDCC---VTCEGLPMRMDQCCERTLGCQPMPVH---C

CxxxxxCxxxPxxxxxC


C C P CCxxxxxCxxxPxxxxxC


Strict consensus patterns contain no alternative residues no flexible regions no mismatches no gaps C--QASCDGIPLKMNDC

C---VTCEGLPMRMDQCCERTLGCQPMPVH---C



Most motifs defined as regular expressions Motifs can contain

alternative residues flexible regions

C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C

CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG


The PROSITE Syntax

A-[BC]-X-D(2,5)-{EFG}-H A B or C anything 2-5 D’s not E, F, or G H


PROSITE entries

Mandatory motifs characterise a protein (super-) family

ID SUBTILASE_ASP; PATTERN.

DE Serine proteases, subtilase family, aspartic acid active site.

PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH].

ID SUBTILASE_HIS; PATTERN.DE Serine proteases, subtilase family, histidine active site.PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM].

ID SUBTILASE_SER; PATTERN.DE Serine proteases, subtilase family, serine active site.PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG].


Exercise

Find the three subtilase motifs in prosite (prosite.expasy.org)

Compare the lists of proteins in which the motifs occur – what does this tell you?

Similarly, compare protein structures in which the motifs occur

Have a look at the “sequence logo”



Some motifs occur frequently in proteins; they may not actually be present, such as Post-translational modification sites

ID ASN_GLYCOSYLATION; PATTERN.DE N-glycosylation site.PA N-{P}-[ST]-{P}.


Exercise

Use a glycosylation site predictor such as http://www.cbs.dtu.dk/services/NetNGlyc/

Input: your favorite set of sequences

Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not?

http://www.cbs.dtu.dk/services/NetNGlyc/

http://www.cbs.dtu.dk/services/NetNGlyc/


Profiles

Many motifs cannot be easily defined using simple patterns

Such motifs can be defined using profiles A profile is constructed from a multiple

sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur


Calculating a Profile

For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix

An (extremelysimple) example:

seq_01 A A A A A A A A A A Wseq_02 A A A A A A A A A W Wseq_03 A A A A A A A A W W Wseq_04 A A A A A A A W W W Wseq_05 A A A A A A W W W W Wseq_06 A A A A A W W W W W Wseq_07 A A A A W W W W W W Wseq_08 A A A W W W W W W W Wseq_09 A A W W W W W W W W Wseq_10 A W W W W W W W W W W


A C D E F G H I K L M 10A: 4.0 0.0 -2.0 -1.0 -2.0 0.0 -2.0 -1.0 -1.0 -1.0 -1.0 N P Q R S T V W Y -2.0 -1.0 -1.0 -1.0 1.0 0.0 0.0 -3.0 -2.0

A C D E F G H I K L M 5A+5W: 1.0 -2.0 -6.0 -4.0 -1.0 -2.0 -4.0 -4.0 -4.0 -3.0 -2.0 N P Q R S T V W Y -6.0 -5.0 -3.0 -4.0 -2.0 -2.0 -3.0 8.0 0.0

A C D E F G H I K L M 10W: -3.0 -2.0 -4.0 -3.0 1.0 -2.0 -2.0 -3.0 -3.0 -2.0 -1.0 N P Q R S T V W Y -4.0 -4.0 -2.0 -3.0 -3.0 -2.0 -3.0 11.0 2.0

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3

prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix;

Excerpt from the EBLOSUM62 matrix:


Pattern Searching

Short linear motifs: e.g. http://dilimot.russelllab.org/

Profiles: memehttp://meme.sdsc.edu/meme/cgi-bin/meme.cgi

http://dilimot.russelllab.org/

http://dilimot.russelllab.org/


Exercise

Use a number of sequences wich contain the prosite subtilase motif and find motifs in those sequences with MEME


Hydropathy Plot

Prediction hydrophobic and hydrophilic regions in a protein


Partition Coefficients

Water

Oil

Hydrophilic Hydrophobic


Hydrophobicity/Hydrophilicity ValuesFauchere & Pliska Kyte & Doolittle Hopp & Woods Eisenberg

R -1.37 -4.50 3.00 -2.53K -1.35 -3.90 3.00 -1.50D -1.05 -3.50 3.00 -0.90Q -0.78 -3.50 0.20 -0.85N -0.85 -3.50 0.20 -0.78E -0.87 -3.50 3.00 -0.74H -0.40 -3.20 -0.50 -0.40S -0.18 -0.80 0.30 -0.18T -0.05 -0.70 -0.40 -0.05P 0.12 -1.60 0.00 0.12Y 0.26 -1.30 -2.30 0.26C 0.29 2.50 -1.00 0.29G 0.48 -0.40 0.00 0.48A 0.62 1.80 -0.50 0.62M 0.64 1.90 -1.30 0.64W 0.81 -0.90 -3.40 0.81L 1.06 3.80 -1.80 1.06V 1.08 4.20 -1.50 1.08F 1.19 2.80 -2.50 1.19I 1.38 4.50 -1.80 1.38

hydrophilic hydrophobic


ki

kin

ni Hk

H12

1

Hydrophobicity Plot

Sum amino acid hydrophobicity values in a given window

Plot the value in the middle of the window Shift the window one position


Sliding Window Approach

Calculate property for first sub-sequence

Use the result (plot/print/store)

Move to next residue position, and repeat


Hydrophobicity Plot

MEZCALTASTESVERYNICE


Hydrophobicity Plot

-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

MEZCALTASTESVERYNICE


Transmembrane Regions

Rotation is 100 degrees per amino acid

Climb is 1.5 Angstrom per amino acid residue


Transmembrane Regions

30 angstrom

So we need approx. 30 / 1.5 = 20 amino acids to span the membrane


Adapting the window size to the size of the membrane

spanning segment makes the picture easier to interpret

Adapting the window size to the size of the membrane

spanning segment makes the picture easier to interpret


window = 1

window = 9

window = 19

window = 121


Protein Interactions



Obligatory

hemoglobin


Obligatory Transient

hemoglobin Mitochondrial Cu transporters



Yeast two-hybrid (Y2H)

Experimental approaches (1)


Affinity Purification + mass spectrometry (AP-MS)

Experimental approaches (2)


STRING http://string.embl.de/

Interaction Databases



HPRD http://www.hprd.org/


http://string.embl.de/

http://www.hprd.org/



HPRD http://www.hprd.org/ InteroPorc

http://biodev.extra.cea.fr/interoporc/Default.aspx

Many others….E.g. see

http://nar.oxfordjournals.org./content/39/suppl_1.toc


http://string.embl.de/

http://www.hprd.org/

http://biodev.extra.cea.fr/interoporc/Default.aspx


Yeast protein interaction network


Sequence-based Protein Binding Site Prediction


Binding site


Predefined motifs


• Group proteins which have same interaction partner• Use motif search, e.g. find PWMs

Neduva Plos Biol 2005

Motif search in groups of proteins


• Group proteins which have same interaction partner• Use motif search

Motif search in groups of proteins


Correlated Motif Search


Interactors Non-interactorsAARLL PLTEQ AARLL MARLTMARLT DLTEP VVRLM MARLTVVRLM MMTER PLTEQ DLTEP

Correlated Motif Pair: (RL,TE)

Correlated Motif Search


Van Dijk et al, Plos Comp Biol 2010

Experimental validation

www.bioinformatics.nlwww.bioinformatics.nlBoyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011

• Faster approach genome wide searching for interaction motifs• Improve mining algorithm with a priori biological knowledge

(conservation score, surface accessibility)

New approach: slider


THE END….. Questions?



Traditional methods (statistical and/or rule-based) E.g. Garnier, Osguthorpe & Robson

• Statistical method

Accuracy ~ 60%


i-8 i-6 i-4 i-2 i i+2 i+4 i+6 i+8Gly -5 -10 -15 -20 -30 -40 -50 -60 -86 -60 -50 -40 -30 -20 -15 -10 -5ala 5 10 15 20 30 40 50 60 65 60 50 40 30 20 15 10 5val 0 0 0 0 0 0 5 10 14 10 5 0 0 0 0 0 0leu 0 5 10 15 20 25 28 30 32 30 28 25 20 15 10 5 0ile 5 10 15 20 25 20 15 10 6 0 -10 -15 -20 -25 -20 -10 -5ser 0 -5 -10 -15 -20 -25 -30 -35 -39 -35 -30 -25 -20 -15 -10 -5 0thr 0 0 0 -5 -10 -15 -20 -25 -26 -25 -20 -15 -10 -5 0 0 0asp 0 -5 -10 -15 -20 -15 -10 0 5 10 15 20 20 20 15 10 5glu 0 0 0 0 10 20 60 70 78 78 78 78 78 70 60 40 20asn 0 0 0 0 -10 -20 -30 -40 -51 -40 -30 -20 -10 0 0 0 0gln 0 0 0 0 5 10 20 20 10 -10 -20 -20 -10 -5 0 0 0lys 20 40 50 55 60 60 50 30 23 10 5 0 0 0 0 0 0his 10 20 30 40 50 50 50 30 12 -20 -10 0 0 0 0 0 0arg 0 0 0 0 0 0 0 0 -9 -15 -20 -30 -40 -50 -50 -30 -10phe 0 0 0 0 0 5 10 15 16 15 10 5 0 0 0 0 0tyr -5 -10 -15 -20 -25 -30 -35 -40 -45 -40 -35 -30 -25 -20 -15 -10 -5trp-10 -20 -40 -50 -50 -10 0 10 12 10 0 -10 -50 -50 -40 -20 -10cys 0 0 0 0 0 0 -5 -10 -13 -10 -5 0 0 0 0 0 0met 10 20 25 30 35 40 45 50 53 50 45 40 35 30 25 20 10pro-10 -20 -40 -60 -80-100-120-140 -77 -60 -30 -20 -10 0 0 0 0

GOR Helix Parameters


i-8 i-6 i-4 i-2 i i+2 i+4 i+6 i+8Gly -5 -10 -15 -20 -30 -40 -50 -60 -86 -60 -50 -40 -30 -20 -15 -10 -5ala 5 10 15 20 30 40 50 60 65 60 50 40 30 20 15 10 5val 0 0 0 0 0 0 5 10 14 10 5 0 0 0 0 0 0leu 0 5 10 15 20 25 28 30 32 30 28 25 20 15 10 5 0ile 5 10 15 20 25 20 15 10 6 0 -10 -15 -20 -25 -20 -10 -5ser 0 -5 -10 -15 -20 -25 -30 -35 -39 -35 -30 -25 -20 -15 -10 -5 0thr 0 0 0 -5 -10 -15 -20 -25 -26 -25 -20 -15 -10 -5 0 0 0asp 0 -5 -10 -15 -20 -15 -10 0 5 10 15 20 20 20 15 10 5glu 0 0 0 0 10 20 60 70 78 78 78 78 78 70 60 40 20asn 0 0 0 0 -10 -20 -30 -40 -51 -40 -30 -20 -10 0 0 0 0gln 0 0 0 0 5 10 20 20 10 -10 -20 -20 -10 -5 0 0 0lys 20 40 50 55 60 60 50 30 23 10 5 0 0 0 0 0 0his 10 20 30 40 50 50 50 30 12 -20 -10 0 0 0 0 0 0arg 0 0 0 0 0 0 0 0 -9 -15 -20 -30 -40 -50 -50 -30 -10phe 0 0 0 0 0 5 10 15 16 15 10 5 0 0 0 0 0tyr -5 -10 -15 -20 -25 -30 -35 -40 -45 -40 -35 -30 -25 -20 -15 -10 -5trp-10 -20 -40 -50 -50 -10 0 10 12 10 0 -10 -50 -50 -40 -20 -10cys 0 0 0 0 0 0 -5 -10 -13 -10 -5 0 0 0 0 0 0met 10 20 25 30 35 40 45 50 53 50 45 40 35 30 25 20 10pro-10 -20 -40 -60 -80-100-120-140 -77 -60 -30 -20 -10 0 0 0 0

I S G A R N I E R H E L I X P R E D I C T


GOR Prediction

beta sheet

helix



Recent methods Neural networks = flexible statistics Multiple alignments = variability Heuristics = common sense

Or a combination of the above

Accuracy ~ 70%


Heuristics

Conserved parts are structurally and/or functionally important

Segments with many gaps must be in loop regions



Strategy

Use as many methods as possible

Use homologous sequences

Combine predictions into consensus prediction


Why can’t it be 100% correct?

All current 2D prediction schemes are based upon observation of occurrence of 2D elements in 3D structures

Deduction of 2D elements from structures is ambiguous! DSSP, Stride, and the PDB (human) annotation

do not always agree upon the assigned elements


Do these residues still belong to the helix?

Protein Sequence Motifs

Documents

Transcript of Protein Sequence Motifs