Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J....
-
Upload
aileen-wheeler -
Category
Documents
-
view
226 -
download
2
Transcript of Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J....
![Page 1: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/1.jpg)
Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays
J. Tobler, M. Molla, J. ShavlikUniversity of Wisconsin-Madison M. Molla, E. Nuwaysir, R. GreenNimblegen Systems Inc.
![Page 2: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/2.jpg)
probes
surface
Oligonucleotide Microarrays
Specific probes synthesized atknown spot on chip’s surface
Probes complementary to RNA of genes to be measured
Typical gene (1kb+) MUCH longer than typical probe (24 bases)
![Page 3: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/3.jpg)
Probes: Good vs. Bad
good probe
bad probe
Blue = ProbeRed = Sample
![Page 4: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/4.jpg)
Probe-Picking Method Needed
Hybridization characteristics differ between probes
Probe set represents very small subset of gene
Accurate measurement of expression requires good probe set
![Page 5: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/5.jpg)
Related Work
Use known hybridization characteristics
Lockhardt et al. 1996
Melting point (Tm) predictionsKurata and Suyama 1999
Li and Stormo 2001
Stable secondary structureKurata and Suyama 1999
![Page 6: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/6.jpg)
Our Approach
Apply established machine-learning algorithms Train on categorized examples Test on examples with category hidden
Choose features to represent probes
Categorize probes as good or bad
![Page 7: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/7.jpg)
The FeaturesFeature Name Description
fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer
fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT,fracTA, fracTC, fracTG, fracTT
The fraction of each of these dimers in the 24-mer
n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer
d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer
![Page 8: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/8.jpg)
The Data
Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG…Complement: CATCGATCGTAATCGTACCGGTCAGTAC…
Probe 1: CATCGATCGTAATCGTACCGGTCA
Probe 2: ATCGATCGTAATCGTACCGGTCAG
Probe 3: TCGATCGTAATCGTACCGGTCAGT
… …
Tilings of 8 genes (from E. coli & B. subtilus) Every possible probe (~10,000 probes) Genes known to be expressed in sample
![Page 9: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/9.jpg)
Our Microarray
![Page 10: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/10.jpg)
0 99
Defining our Categories
Normalized Probe Intensity
Low Intensity = BAD Probes
(45%)
High Intensity = GOOD
Probes (32%)
Mid-Intensity = Not Used in Training Set
(23%)
Frequenc
y
0 .05 .15 1.0
![Page 11: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/11.jpg)
The Machine Learning Techniques
Naïve Bayes (Mitchell 1997)
Neural Networks (Rumelhart et al. 1995)
Decision Trees (Quinlan 1996)
Can interpret predictions of each learner probabilistically
![Page 12: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/12.jpg)
Naïve Bayes
Assumes conditional independence between features
Make judgments about test set examples based on conditional probability estimates made on training set
![Page 13: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/13.jpg)
Naïve Bayes
For each example in the test set, evaluate the following:
ilowivalueifeaturePlowP
ihighivalueifeaturePhighP
)|()(
)|()(
![Page 14: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/14.jpg)
Neural Network(1-of-n encoding with probe length = 3)
Example probe
sequence: “CAG”
Weights
ACTIVATI
O
NERROR
Good or Bad…
…
A2
C2
G2
T2
A3
C3
G3
T3
A1
C1
G1
T1
![Page 15: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/15.jpg)
Decision Tree
n14
fracAC
fracT
fracTC
Bad Probe … … …
Good Probe
…
…
Automatically builds a tree of rules
High
…
…Low
High
Low High
…
Low High
Low
High
C G TA
fracC
fracG
fracG
Low
Low
High
![Page 16: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/16.jpg)
Decision Tree
The information gain of a feature, F, is:
)(||
||)(
),(
)(v
FValuesv
v SEntropyS
SSEntropy
FSnGainInformatio
![Page 17: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/17.jpg)
Information Gain per Feature
CG
CC
C
A G
T
AA
AC
AG
ATCA
CTGA GG
TC
GC TAGT TT
TG0.0
1.0
22 2324 1 2 3 4 5 6 789 10
11 1213 1415 16 1718 19 20
21 22232119 2017181614 151311 129 108764 51 2 3
0.0
1.0
Probe Composition Features
Norm
aliz
ed
In
form
ati
on
Gain
Base Position Features
Base Position
Dimer Position
![Page 18: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/18.jpg)
Cross-Validation
Leave-one-out testing: For each gene (of the 8)
Train on all but this geneTest on this geneRecord resultForget what was learned
Average results across 8 test genes
![Page 19: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/19.jpg)
Typical Probe-Intensity Prediction Across Short Region
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
650 655 660 665 670 675 680 685 690 695 700
Actual
Norm
aliz
ed
Pro
be In
ten
sity
Starting Nucleotide Position for 24-mer Probe
![Page 20: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/20.jpg)
Typical Probe-Intensity Prediction Across Short Region
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
650 655 660 665 670 675 680 685 690 695 700
Naïve Bayes Decisio
n Tree
Neural Network
Actual
Norm
aliz
ed
Pro
be In
ten
sity
Starting Nucleotide Position for 24-mer Probe
![Page 21: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/21.jpg)
Probe-Picking Results
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er
of
pro
bes
sele
cted
wit
h
inte
nsi
ty >
= 9
0th p
erc
enti
le
Number of probes selected
Perfect Selector
![Page 22: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/22.jpg)
Probe-Picking Results
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er
of
pro
bes
sele
cted
wit
h
inte
nsi
ty >
= 9
0th p
erc
enti
le
Number of probes selected
Naïve Bayes
Neural Network
Decision Tree
Primer Melting Point
Perfect Selector
![Page 23: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/23.jpg)
Current and Future Directions
Consider more features Folding patterns Melting point
Feature selection
Evaluate specificity along with sensitivity Ie, consider false positives
Evaluate probe selection + gene calling
Try more ML techniques SVMs, ensembles, …
![Page 24: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/24.jpg)
Take-Home Message
Machine learning does a good job on this part of probe-selection problem Easy to collect large number of training
ex’s Easily measured features work well
Intelligent probe selection can increase microarray accuracy and efficiency
![Page 25: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/25.jpg)
Acknowledgements
NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. Darryl Roy for helping in creating the training data. Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.
![Page 26: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649ee15503460f94bf20d1/html5/thumbnails/26.jpg)
Thanks