IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to...
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to...
![Page 1: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/1.jpg)
IPK GaterslebenPattern Recognition Group
Correlation-based Data Processing
and its Application to Biology
Marc Strickert
Osnabrück, 14. Januar 2005
Pattern Recognition Group
Schloss Dagstuhl
Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben
![Page 2: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/2.jpg)
IPK GaterslebenPattern Recognition Group
Goals
1. Attribute rating
2. Clustering
3. Classification
4. Visualization
of biological data,
exploiting properties of
Pearson correlation.
![Page 3: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/3.jpg)
IPK GaterslebenPattern Recognition Group
Euclidean distances may be problematic
d1= (x1-y1)2+ … + (x5-y5)21
2 d2= (x1-y1)2+ … + (x5-y5)2
identical despite ofdifferent shapes
[ John Lee and Michel Verleysen ]
![Page 4: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/4.jpg)
IPK GaterslebenPattern Recognition Group
Pearson correlation invariant to scaling and shifting
amplitudevertical offset
same correlations as above!
same profiles, aligned
raw data
Up-regulated gene profiles
Euclideanview
'Pearson'view
![Page 5: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/5.jpg)
IPK GaterslebenPattern Recognition Group
Derivatives of squared Euclidean and Pearson correlation
Squared Euclidean:
Pearson correlation:
![Page 6: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/6.jpg)
IPK GaterslebenPattern Recognition Group
Applications for derivative of similarity measure
4. Visualization
(High-Throughput MDS)
2. Clustering
(Neural Gas for Correlation, NG-C)3. Classification
(GRLVQ-C)
1. Attribute rating
(Variance analogon)
![Page 7: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/7.jpg)
IPK GaterslebenPattern Recognition Group
Attribute rating
=
Squared Euclidean distance
Variance as double sum of derivatives
![Page 8: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/8.jpg)
IPK GaterslebenPattern Recognition Group
Correlation Analogon to Euclidean Variance
X
W
![Page 9: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/9.jpg)
IPK GaterslebenPattern Recognition Group
Clustering: Neural Gas (NG revisited)
NG-C:
![Page 10: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/10.jpg)
IPK GaterslebenPattern Recognition Group
High centroid reproducibility with NG-C
NG-C
k-means
23 gene expression centroids, 10 independent runs
Indeterminate final states.
Crisp final states.
![Page 11: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/11.jpg)
IPK GaterslebenPattern Recognition Group
Classification with relevance learning
For example used in
GeneralizedLearningVector Quantizationwith Correlation(GRLVQ-C)
Adaptive Pearson correlation:
![Page 12: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/12.jpg)
IPK GaterslebenPattern Recognition Group
Leukemia cancer data set: AML / ALL separation
GRLVQ-C: Relevance factors top 10 gene ranking.
1 prototype per class + relevance learning.
consistent with Golub et al.
![Page 13: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/13.jpg)
IPK GaterslebenPattern Recognition Group
Visualization of high-dimensional data
High-dimensional data (constant source)
Low-dimensional points (variable target)
AB
C
A' B'
C'3D 2D
d12
d23
d13
d12
d23d13
“embedding”
Gradient-based stochastic optimization HiT-MDS.
!
![Page 14: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/14.jpg)
IPK GaterslebenPattern Recognition Group
Maximize distance correlations: source ≈ reconstruction
original inter-point distance matrix
reconstructed inter-point distance matrix
Adaptive parameters point coordinates
Minimize embedding stress function using negative Fischer's Z':
![Page 15: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/15.jpg)
IPK GaterslebenPattern Recognition Group
Iterative gradient descent for stress function minimization
| derivative of Fischer's Z'
| for Euclidean spaces
![Page 16: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/16.jpg)
IPK GaterslebenPattern Recognition Group
High-Throughput Multi-Dimensional Scaling (HiT-MDS)
Initialize X by random projection (or smarter).
Calculate correlation r(X,X) once.
Draw next Pattern xi.
Minimize stress s to all xj: xik ~ -∂s / ∂xi
k.
recalculate distances dij.
adapt
Hit-MDS Algorithm
, , and r.
Input xi X Embedding xi X
dij dij
r(dij , dij)
s
1
12
2
3
34
4
![Page 17: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/17.jpg)
IPK GaterslebenPattern Recognition Group
Applications of dimension reduction (visualization)
1. Gene space browser.
2. Macro-experiment grouping.
day 0
day 26
1
2
![Page 18: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/18.jpg)
IPK GaterslebenPattern Recognition Group
Embedding 12k Genes (14 time points) in 2D
UI
D
D
I
U
orig spline
FITFITFIT
EUC COR SRC
COR COR
EUCEuclidean distance
CORPearson correlation
SRCSpearman rank cor.
![Page 19: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/19.jpg)
IPK GaterslebenPattern Recognition Group
Gene browser (4824 high-quality genes)
0 2 4 6 8 10 12 14 16 18 20 22 24 26
DAF
…
[ visualization: www.ggobi.org ]
![Page 20: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/20.jpg)
IPK GaterslebenPattern Recognition Group
Gene browser for powers of correlation: (1-r)8
![Page 21: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/21.jpg)
IPK GaterslebenPattern Recognition Group
Gene clustering (k=11), relevant genes in front
![Page 22: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/22.jpg)
IPK GaterslebenPattern Recognition Group
3D-View of 62 macroarrays (4824 genes)
![Page 23: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/23.jpg)
IPK GaterslebenPattern Recognition Group
Data processing challenges in biology
Data Sets from- metabolite measurements (2D-gels, HPLC),- QTL LOD-score pattern compression,- DNA-sequence arrangement.
Missing value imputation ( probabilistic models)
Association studies ( common latent space, CCA)
Rank-based data analysis ( distribution models)
Faithful low-dimensional data representation
Proximity data handling
Common language: R / MATLAB / … ?
![Page 24: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/24.jpg)
IPK GaterslebenPattern Recognition Group
Thanks
http://pgrc-16.ipk-gatersleben.de/~stricker/
http://hitmds.webhop.net/
Pattern recognition group (IPK, headed by Udo Seiffert)
Nese Sreenivasulu (IPK, Molecular Biology)
Barbara Hammer (TU-Clausthal)
Thomas Villmann (University of Leipzig)
![Page 25: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649d445503460f94a20342/html5/thumbnails/25.jpg)
IPK GaterslebenPattern Recognition Group
Some References
Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U.Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006.
Strickert M.; Sreenivasulu N.; Seiffert, U.Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006.
Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B.Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data.Neurocomputing 69(2006), pp. 651-659, Springer, 2006.
Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U.Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue.To appear in BMC Bioinformatics, 2007.
Strickert M.; Sreenivasulu N.; Seiffert, U.Browsing temporally regulated gene expressions in correlation-maximizing space.Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).