Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University.
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University.
Feature boundaries
• Need to know form and function of protein sequences to understand complex biological systems
• Not possible to directly determine features or functions directly– estimate feature positions by indirect
laboratory experiments, e.g. hydrophobicity
• Use statistical measures of association to determine feature boundaries
Feature boundaries
• Proteins are comprised of adjacent, non-overlapping features:– helical, cytoplasmic, periplasmic, extracellular,
intracellular, etc.
• GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)
C
N
ec1 ec2ec3
cp1 cp2 cp3
1 2 3 4 5 6 7
Cytoplasmic (cp) Domain
Trans-Membrane(helices)Domain
Extracellular (ec) Domain
A.
B.
Segments contributing to extracellular (ec) domainSegments contributing to transmembrane (helices) domainSegments contributing to cytoplasmic (cp) domain
CN
Goal: Statistically determine feature boundaries in sequences of amino acids
S H D E G C L S S E P K P R K Q S D S S T
Association measures
S H D E G C L S S E P K P R K Q S D S S T
2.5
2.5 is a measure of the strength of the association between P and R
Association measures
S H D E G C L S S E P K P R K Q S D S S T
2.34.5
1.10.8 5.51.2 3.70.3 6.21.24.8
4.15.2
2.5 1.80.20.7
1.1 6.23.4 1.1
Association measures
S H D E G C L S S E P K P R K Q S D S S T
2.34.5
1.10.8 5.51.2 3.70.3 6.21.24.8
4.15.2
2.5 1.80.20.7
1.1 6.23.4 1.1
4.2
Adjacent pairs with low association measuresare candidates for partition points.
Association measures are used to quantify correlations between
adjacent amino acids
• Yule’s Q statistic
• Mutual information
EP
MS N
V VV
GFRF
Y
CK
HMI AN
QQ Q
AA
K E A
V F TVQ L
TVRM
SAT TQKA EKE
I I V
E
IM
MY
RG
TT
V QH
KR N
TT
VM
L
C
Cytoplasmic (cp) Domain
T L Y V
N F L I YN LC C
I II III IV V VI VII
LK
P
KN
QF
5575
136 155225 256 306
cp1
cp2cp3
AOOC - PAVQST E T KS V T-
TS AED
DGLP
KN
Cytoplasmic(cp)
Domain
Transmembrane(helices)Domain
Extracellular(ec)
Domain
MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301
Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309
61309
74
153253
230
133
Hydropathy breaks
301
63
76155255
136 233Cytoplasmic
(cp)Domain
Transmembrane(helices)Domain
Extracellular(ec)
Domain
MI breaks
• The changes in association measure values correspond to feature boundaries
• Goal: automatically detect partition points based on association measures
Partitioning algorithm
• Cluster adjacent association values– each group is represented by its mean value
• Calculate standard deviation of values over all clusters
• Locate partition points in data based on:– deviation from mean– [change between adjacent values]
Parameters
• Cluster adjacent association values– each group is represented by its mean value
window size for computing mean
• Calculate standard deviation of values over all clusters
• Locate partition points in data based on:– deviation from mean– [change between adjacent values]
cutoff distance from mean for a value to be
considered “extreme”
Class A Rhodopsin like Amine Peptide Hormone protein (Rhodopsin
Rhodopsin Vertebrate Rhodopsin Vertebrate type 1 Rhodopsin Vertebrate type 2 Rhodopsin Vertebrate type 3 Rhodopsin Vertebrate type 4 Rhodopsin Vertebrate type 5
Rhodopsin Arthropod Rhodopsin Mollusc Rhodopsin Other
Olfactory Prostanoid Nucleotide-like Cannabis Platelet activating factor Gonadotropin-releasing hormone Thyrotropin-releasing hormone & Secretagogue Melatonin Viral Lysosphingolipid & LPA (EDG) Leukotriene B4 receptor Class A Orphan/other
Class B Secretin like Class C Metabotropic glutamate / pheromone Class D Fungal pheromone Class E cAMP receptors (Dictyostelium) Frizzled/Smoothened family
GPCR: different subfamilies
GPCR: different subfamilies
Size: Hierarchy:
717755 GPCR
371134 Class A
48393 Rhodopsin
33543 Vertebrate
20314 Vertebrate 1
348 opsd_human
39724 Class B
20930 Class C