Artificial Intelligence Research Laboratory Department of Computer Science

1
Artificial Intelligence Research Laboratory Department of Computer Science RECOMB 2009 Acknowledgements: This work is supported in part by a grant from the National Science Foundation (NSF 0711356) to Vasant Honavar. Combining Abstraction and Super-structuring on Macromolecular Sequence Classification Adrian Silvescu, Cornelia Caragea, and Vasant Honavar Introduction: The choice of features that are used to describe the data presented to a learner, and the level of detail at which they describe the data, can have a major impact on the difficulty of learning, and the accuracy, complexity, and comprehensibility of the learned predictive model. The representation has to be rich enough to capture the distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning infeasible. Results: Eukaryotes 3-grams Prokaryotes 3-grams Eukaryotes 2-grams Prokaryotes 2-grams 10 Abstractions 1000 Abstractions Comparison of super-structuring and abstraction (SS+ABS) with super-structuring and feature selection (SS+FSEL), super- structuring only (SS_ONLY), and unigram (UNIGRAM) on the Eukaryotes and Prokaryotes data sets. Class distributions induced by one of the m abstractions, and the class distributions induced by three 3-grams sampled from the abstraction on the Eukaryotes 3-gram data set, where (a) m=10; and (b) m=1000. The number of classes is 4. Problem: Predict the subcellular localization for a protein sequence. Previous Approaches to Feature Construction: Constructing Abstractions over k-grams: Super-structuring: generating k-grams SINQKLALVIKSGKYTLGYKSTVKS LRQGKSKLIIIAANTPVLRKSELEY YAMLSKTKVYYFQGGNNELGTAVGK LFRVGVVSILEAGDSDILTTLA INQ SIN NQK QKL KLA LVI LAL ALV Abstraction: grouping similar features to generate more abstract features Example: Our Approach: Combining super-structuring and abstraction to construct new features Distance between Abstractions: dist ( a i , a j ) ( p ( a i ) p ( a j )) WJS ([ p ( Y | a i ),# a i ],[ p ( Y | a j ),# a j ]) a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 9 a 8 a 1 0 a 1 1 SIN INQ NQK QKL KLA WJS ([ p 1 ( ), w 1 ],[ p ( ), w 2 ]) 1 KL ( p 1 ( )|| p ( )) 2 KL ( p 2 ( )|| p ( )) Let 1 w 1 w 1 w 2 , 2 w 2 w 1 w 2 , p ( ) 1 p 1 ( ) 2 p 2 ( ) The weighted Jensen-Shannon divergence is given by: w 1 , w 2 [0, ), X be a finite set, and p 1 , p 2 two probability distributions over X Then, distance between two abstractions is defined as follows: Feature selection: Data sets: Conclusions: where Y is the class variable. greedy agglomerative procedure initially map each abstraction to a k-gram recursively group pairs of abstractions until m abstractions are obtained, e.g., m=2 alternative approach to reducing the number of k-grams to m k-grams we used mutual information between the class variable and k-grams to rank the k- grams Eukaryotes contains 2,427 protein sequences classified into one of four classes Prokaryotes contains 997 protein sequences classified into one of three classes combining super-structuring and abstraction makes it possible to construct predictive models that use significantly smaller number of features than those obtained using super-structuring alone. abstraction in combination with super- structuring yields better performing models than those obtained by feature selection in combination with super-structuring. We have shown that:

description

a 11. a 10. …. a 7. a 8. a 9. SIN. INQ. NQK. QKL. KLA. a 1. a 2. a 3. a 4. a 5. a 6. Let. X be a finite set, and. two probability distributions over X. INQ. SIN. KLA. NQK. QKL. The weighted Jensen-Shannon divergence is given by:. LVI. LAL. ALV. - PowerPoint PPT Presentation

Transcript of Artificial Intelligence Research Laboratory Department of Computer Science

Page 1: Artificial Intelligence Research Laboratory Department of Computer Science

Artificial Intelligence Research LaboratoryDepartment of Computer Science RECOMB 2009

Acknowledgements: This work is supported in part by a grant from the National Science Foundation (NSF 0711356) to Vasant Honavar.

Combining Abstraction and Super-structuring on Macromolecular Sequence ClassificationAdrian Silvescu, Cornelia Caragea, and Vasant Honavar

Introduction:The choice of features that are used to describe the data presented to a learner, and the level of detail at which they describe the data, can have a major impact on the difficulty of learning, and the accuracy, complexity, and comprehensibility of the learned predictive model. The representation has to be rich enough to capture the distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning infeasible.

Results:

Eukaryotes 3-grams Prokaryotes 3-grams

Eukaryotes 2-grams Prokaryotes 2-grams

10 Abstractions 1000 Abstractions

Comparison of super-structuring and abstraction (SS+ABS) with super-structuring and feature selection (SS+FSEL), super-structuring only (SS_ONLY), and unigram (UNIGRAM) on the Eukaryotes and Prokaryotes data sets.

Class distributions induced by one of the m abstractions, and the class distributions induced by three 3-grams sampled from the abstraction on the Eukaryotes 3-gram data set, where (a) m=10; and (b) m=1000. The number of classes is 4.

Problem: Predict the subcellular localization for a protein sequence.

Previous Approaches to Feature Construction:

Constructing Abstractions over k-grams:

Super-structuring: generating k-grams

SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA

INQ

SIN

NQKQKL

KLA

LVILALALV

Abstraction: grouping similar features to generate more abstract features

Example:

Our Approach:

Combining super-structuring and abstraction to construct new features

Distance between Abstractions:

dist(ai,a j ) (p(ai) p(a j ))WJS([p(Y | ai),# ai],[p(Y | a j ),# a j ])

a1 a2 a3 a4 a5 a6

a7 a9a8

a10

a11

SIN INQ NQK QKL KLA …

WJS([p1(),w1],[p(),w2])

1KL(p1() || p()) 2KL(p2() || p())

Let

1 w1

w1 w2

,

2 w2

w1 w2

,

p() 1p1() 2p2()

The weighted Jensen-Shannon divergence is given by:

w1,w2 [0,), X be a finite set, and

p1,

p2 two probability distributions over X

Then, distance between two abstractions is defined as follows:

Feature selection:

Data sets:

Conclusions:

where Y is the class variable.

greedy agglomerative procedure initially map each abstraction to a k-gram recursively group pairs of abstractions until m

abstractions are obtained, e.g., m=2

alternative approach to reducing the number of k-grams to m k-grams

we used mutual information between the class variable and k-grams to rank the k-grams

Eukaryotes contains 2,427 protein sequences classified into one of four classes

Prokaryotes contains 997 protein sequences classified into one of three classes

combining super-structuring and abstraction makes it possible to construct predictive models that use significantly smaller number of features than those obtained using super-structuring alone.

abstraction in combination with super-structuring yields better performing models than those obtained by feature selection in combination with super-structuring.

We have shown that: