Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Identifying Extracellular Plant Proteins Based on

Frequent Subsequences of Amino Acids

Y. Wang, O. Zaiane, R. Goebel

IntroductionProtein: linear sequence of amino acidsProtein subcellular localization Plant: nuclear, cytoplamic,

mitochondria, extracellular, …Intracellular vs. Extracellular Sequence information alone Class imbalance Transparency

Related WordN-terminal sorting signalsAmino acid compositionLexical analysisIntegrative approachSubsequence methods

Predicting Extracellular Proteins

Feature ExtractionSupport Vector MachineBoostingFrequent Pattern Method

Feature ExtractionFrequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins Strong discriminative power Perform similar functions via

relationed biochemical mechanism Capture local similarity

Generalized Suffix Tree

Support Vector MachineInput data represented as feature vectorsFind a linear separator that separate the data and maximize the marginKernel function: nonlinear separator

SVM for extracellular protein prediction

Data Transformation(sequencevector) Frequent subsequences as features Transform protein sequence as binary

vectorsKernel Functions Linear kernel Polynomial kernel RBF kernel

BoostingIterative algorithms to improve weak classifierDifferent weighted distribution of examples in each iterationIncrease the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

AdaBoost

Frequent Pattern MethodFrequent pattern: *X1*X2*…*Xn* extracellular X1,X2,…Xn are frequent

subsequences “*” can be substituted to zero or up to

MaxGap amino acids when matching a protein sequence

FOIL algorithm

Z-number

:accuracy of rule R:support of rule R

ExperimentsDataset(PASub project at UofA) Plant: 3293 proteins, 171 extracellularFive-cross validation

Evaluation MatrixOverall accuracy is not good enoughF-measure

Result(SVM with subsequence)

Result(Boosting with subsequence)

Result(Frequent Pattern)

MinLen=3Min_gain=0.1

03.08.0

MinSup=5%MinConf=80%MaxGap=300

Result(SVM with composition)

Result(Boosting with composition)

Cross Comparision

SVM with combined features

Boosting with combined features

Effects of MinLen on SVM

Effects of MinLen on boosting

ConclusionPresented three methods for identifying extracellular proteins based on frequent subsequence of amino acidsSVM achieves the best resultFSP method provides easily interpretable rules

Future WorkUse for information about proteins (e.g., structure, function, …)Integrating amino acid composition into FSP methodIncorporate more biological knowledge

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Documents

Transcript of Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

The Surprising Mathematics of Longest Increasing Subsequences

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Extracellular Vesicles: Cell-Derived Biomarkers of ...€¦ · Extracellular vesicles • Kidney • Urine • Biomarker Abstract Extracellular vesicles (EVs) are important mediators

Checking Sequence Generation Using State Distinguishing Subsequences

PENCOCOKAN DNA MENGGUNAKAN LONGEST COMMON SUBSEQUENCES (LCS)

Lecture 8: the extracellular matrix - pre-med.jumedicine.compre-med.jumedicine.com/.../sites/8/...matrix_and_cell-cell_interaction.pdf · The extracellular matrix (ECM) The extracellular

Mining more complex patterns: frequent subsequences and ...

Subsequences - qmplus.qmul.ac.uk

Longest Common Subsequences

131A Week 6 Discussion - subsequences and countabilityazhou/teaching/20S/131a-week...Alan Zhou Subsequences Countability Countable and uncountable sets I By countable, we mean \ nite

Scalable Frequent Sequence Mining With Flexible ... · Frequent sequence mining (FSM) is a data mining task that ﬁnds frequent subsequences in a sequence database. FSM is ubiquitous

Dynamic programming algorithms for all-pairs shortest path and longest common subsequences

iVOLVING PERMUTATIONS AS SUBSEQUENCES MALCOLM NEWEY ...i.stanford.edu/pub/cstr/reports/cs/tr/73/340/CS-TR-73-340.pdf · notes on a problem il\ivolving permutations as subsequences

Computing Longest Increasing Subsequences over Sequential ... · Computing Longest Increasing Subsequences over Sequential Data Streams Youhuan Liy, Lei Zouy, Huaming Zhangz, Dongyan

Extracellular Pathology

Mining Frequent Patterns, Associations and Correlationsmkacimi/frequentpatterns.pdfFrequent Pattern Analysis ! Frequent Pattern: a pattern (a set of items, subsequences, substructures,

Simple and fast linear space computation of Longest common subsequences Claus Rick, 1999.

Extracellular Vesicles Improve Post-Stroke ...fatstemserbia.brinkster.net/Library/Science/Extracellular...Enabling Technologies for Cell-Based Clinical Translation Extracellular Vesicles

Extracellular Matrix

On Unimodal Subsequences - COnnecting REpositories · 2017. 2. 6. · On Unimodal Subsequences F.R.K. CHUNG Bell Laboratories, Murray Hill, New Jersey 07974 Communicated by the Managing