Pattern Recognition for High-Dimensional Data in...
Transcript of Pattern Recognition for High-Dimensional Data in...
2nd Texas A&M Big Data Workshop – February 12, 2016
Pattern Recognition for High-Dimensional Data in
Biomedicine
Ulisses M. Braga-Neto, Ph.D.
Associate Professor Genomic Signal Processing Laboratory Center for Bioinformatics and Genomic Systems Engineering Department of Electrical and Computer Engineering Texas A&M University
2nd Texas A&M Big Data Workshop – February 12, 2016
Ulisses Braga-Neto
Research Focus Areas: Statistical Pattern Recognition, Statistical Signal Processing, Modeling and Analysis of Molecular Expression Data.
Past and Current Biomedical Collaborators: Marcel Amstalden, TAMU Animal Sciences Michael Bittner, Translational Genomics Robb Chapkin, TAMU Human Nutrition Mary Jane Cunnigham, Nanomics Biosciences Marty Dickman, TAMU Plant Pathology Charles Johnson, TAMU CBGSE Ernesto Marques, Jr., FIOCRUZ, Brazil and University of Pittsburgh Louise Strong, MD Anderson Cancer Center (Post-Doc Mentor)
Mentoring: 5 Graduated Ph.D., 1 Former Post-Doc, 7 Current Ph.D., 1 Current M.Sc.
Peer-Reviewed Publications: 60 Journal Papers, 53 Conference Papers, 1 Book
Research Funding: NSF, TEES, AgriLife.
2nd Texas A&M Big Data Workshop – February 12, 2016
Big Data and The Total Library • In a 1939 essay, Jorge Luis Borges imagined “The Total Library,”
with books containing every possible arrangement of letters.
“Everything would be in its blind volumes. Everything: the detailed history of the future, Aeschylus’ The Egyptians, the exact number of times that the waters of the Ganges have reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934, the proof of Pierre Fermat's theorem, the unwritten chapters of Edwin Drood, those same chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley invented concerning Time but didn't publish, Urizen's books of iron, the premature epiphanes of Stephen Dedalus, which would be meaningless before a cycle of a thousand years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog of the Library, the proof of the inaccuracy of that catalog. Everything.”
– Jorge Luis Borges, “The Total Library.”
2nd Texas A&M Big Data Workshop – February 12, 2016
Reality Check “… but for every sensible line or accurate fact there would be millions of
meaningless cacophonies, verbal farragoes, and babblings. Everything: but all the generations of mankind could pass, before the dizzying shelves – shelves that obliterate the day and on which chaos lies – ever reward them with a tolerable page.”
– Jorge Luis Borges, “The Total Library,” 1939.
"This problem can only be properly undertaken when an approximate knowledge of the orbit has been already attained, which is afterward to be corrected so as to satisfy all the observations in the most accurate manner possible.”
– Karl Friedrich Gauss, “Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium,” 1809.
“Pluralitas non est ponenda sine neccesitate” (Ockham’s Razor) – William of Ockham, 14th Century.
2nd Texas A&M Big Data Workshop – February 12, 2016
Big Data vs. Small Data in Biomedicine
From the statistical point of view, the ratio sample size/dimensionality is fundamental.
Fat Data Matrix = Big Data Tall Data Matrix = Small Data
2nd Texas A&M Big Data Workshop – February 12, 2016
“The Curse of Dimensionality”
“Peaking Phenomenon” / “Hughes Phenomenon”
U.M. Braga-Neto, "Classification and Error Estimation for Discrete Data,” Current Genomics, Vol. 10, No. 7, Nov 2009, pp. 446-462.
G. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information Theory, 1968, IT-14, 55-63.
2nd Texas A&M Big Data Workshop – February 12, 2016
AO
Kim, S. et al. "Identification of Combination Gene Sets for Glioma Classification," Molecular Cancer Therapeutics, Vol. 1,���No. 13, 1229-1236, 2002
Simple Classifiers Are Better
2nd Texas A&M Big Data Workshop – February 12, 2016
Complex classification rule: Simple classification rule:
“Scissors Effect”
Simple Classifiers Are Better
2nd Texas A&M Big Data Workshop – February 12, 2016
Simple Classifiers Are Better
No feature selection
With feature selection
With feature selection
Feature selection produces a simpler, more constrained classification rule
2nd Texas A&M Big Data Workshop – February 12, 2016
Classifier Error Estimation
U.M. Braga-Neto and E. Dougherty, "Is Cross-Validation Valid for Small-Sample Microarray Classification?" Bioinformatics, Vol. 20, No. 3, Feb 2004, pp. 374-380.
Constrained, low-dimensional classifiers facilitate error estimation from limited amounts of sample data.
2nd Texas A&M Big Data Workshop – February 12, 2016
2nd Texas A&M Big Data Workshop – February 12, 2016
Dengue Fever Example
E.J.M. Nascimento, U.M. Braga-Neto et al., "Gene Expression Profiling During Acute Stage of Dengue Infection," PLoS ONE, Vol. 4, No. 11, Nov 2009, p. e7892
2nd Texas A&M Big Data Workshop – February 12, 2016
Breast Cancer Example
van de Vijver, et al. (2002) “A gene-expression signature as apredictor of survival in breast cancer.” New England���Journal of Medicine,���Vol. 347, 1999–2009.
Originallly published 70-gene signature:
independent test-set error = 68/180 = 37.7%
2nd Texas A&M Big Data Workshop – February 12, 2016
More Accurate Classifier (2 genes)
Error ≈ 52/295 = 17.6% (Resubstitution Error)
U.M. Braga-Neto, Fads and Fallacies in the Name of Small-Sample Microarray Classification. IEEE Signal Processing Magazine, Special Issue on Signal Processing Methods in Genomics and Proteomics, Vol. 24, No. 1, January 2007, pp. 91-99.
2nd Texas A&M Big Data Workshop – February 12, 2016
Braga-Neto and Dougherty, Error Estimation for Pattern Recognition, Wiley-IEEE Press, 2015.
This book is the first one dedicated exclusively to the topic of error estimation for pattern recognition. It covers both classical and recent results on the performance of error estimators for nonparametric and parametric classifiers.
Many of the issues related to Big Data discussed in this talk are covered in detail in the book.