Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Post on 15-Mar-2016

16 views 0 download

description

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids. Y. Wang, O. Zaiane, R. Goebel. Introduction. Protein: linear sequence of amino acids Protein subcellular localization Plant: nuclear, cytoplamic, mitochondria, extracellular, … - PowerPoint PPT Presentation

Transcript of Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Identifying Extracellular Plant Proteins Based on

Frequent Subsequences of Amino Acids

Y. Wang, O. Zaiane, R. Goebel

2

IntroductionProtein: linear sequence of amino acidsProtein subcellular localization Plant: nuclear, cytoplamic,

mitochondria, extracellular, …Intracellular vs. Extracellular Sequence information alone Class imbalance Transparency

3

Related WordN-terminal sorting signalsAmino acid compositionLexical analysisIntegrative approachSubsequence methods

4

Predicting Extracellular Proteins

Feature ExtractionSupport Vector MachineBoostingFrequent Pattern Method

5

Feature ExtractionFrequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins Strong discriminative power Perform similar functions via

relationed biochemical mechanism Capture local similarity

6

Generalized Suffix Tree

7

Support Vector MachineInput data represented as feature vectorsFind a linear separator that separate the data and maximize the marginKernel function: nonlinear separator

8

SVM for extracellular protein prediction

Data Transformation(sequencevector) Frequent subsequences as features Transform protein sequence as binary

vectorsKernel Functions Linear kernel Polynomial kernel RBF kernel

9

BoostingIterative algorithms to improve weak classifierDifferent weighted distribution of examples in each iterationIncrease the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

10

AdaBoost

11

Frequent Pattern MethodFrequent pattern: *X1*X2*…*Xn* extracellular X1,X2,…Xn are frequent

subsequences “*” can be substituted to zero or up to

MaxGap amino acids when matching a protein sequence

12

FOIL algorithm

13

Z-number

:accuracy of rule R:support of rule R

14

15

ExperimentsDataset(PASub project at UofA) Plant: 3293 proteins, 171 extracellularFive-cross validation

16

Evaluation MatrixOverall accuracy is not good enoughF-measure

17

Result(SVM with subsequence)

18

Result(Boosting with subsequence)

19

Result(Frequent Pattern)

MinLen=3Min_gain=0.1

03.08.0

MinSup=5%MinConf=80%MaxGap=300

20

Result(SVM with composition)

21

Result(Boosting with composition)

22

Cross Comparision

23

SVM with combined features

24

Boosting with combined features

25

Effects of MinLen on SVM

26

Effects of MinLen on boosting

27

ConclusionPresented three methods for identifying extracellular proteins based on frequent subsequence of amino acidsSVM achieves the best resultFSP method provides easily interpretable rules

28

Future WorkUse for information about proteins (e.g., structure, function, …)Integrating amino acid composition into FSP methodIncorporate more biological knowledge