Pattern Recognition for High-Dimensional Data in...

2nd Texas A&M Big Data Workshop – February 12, 2016

Pattern Recognition for High-Dimensional Data in

Biomedicine

Ulisses M. Braga-Neto, Ph.D.

Associate Professor Genomic Signal Processing Laboratory Center for Bioinformatics and Genomic Systems Engineering Department of Electrical and Computer Engineering Texas A&M University


Ulisses Braga-Neto

Research Focus Areas: Statistical Pattern Recognition, Statistical Signal Processing, Modeling and Analysis of Molecular Expression Data.

Past and Current Biomedical Collaborators: Marcel Amstalden, TAMU Animal Sciences Michael Bittner, Translational Genomics Robb Chapkin, TAMU Human Nutrition Mary Jane Cunnigham, Nanomics Biosciences Marty Dickman, TAMU Plant Pathology Charles Johnson, TAMU CBGSE Ernesto Marques, Jr., FIOCRUZ, Brazil and University of Pittsburgh Louise Strong, MD Anderson Cancer Center (Post-Doc Mentor)

Mentoring: 5 Graduated Ph.D., 1 Former Post-Doc, 7 Current Ph.D., 1 Current M.Sc.

Peer-Reviewed Publications: 60 Journal Papers, 53 Conference Papers, 1 Book

Research Funding: NSF, TEES, AgriLife.


Big Data and The Total Library •  In a 1939 essay, Jorge Luis Borges imagined “The Total Library,”

with books containing every possible arrangement of letters.

“Everything would be in its blind volumes. Everything: the detailed history of the future, Aeschylus’ The Egyptians, the exact number of times that the waters of the Ganges have reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934, the proof of Pierre Fermat's theorem, the unwritten chapters of Edwin Drood, those same chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley invented concerning Time but didn't publish, Urizen's books of iron, the premature epiphanes of Stephen Dedalus, which would be meaningless before a cycle of a thousand years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog of the Library, the proof of the inaccuracy of that catalog. Everything.”

– Jorge Luis Borges, “The Total Library.”


Reality Check “… but for every sensible line or accurate fact there would be millions of

meaningless cacophonies, verbal farragoes, and babblings. Everything: but all the generations of mankind could pass, before the dizzying shelves – shelves that obliterate the day and on which chaos lies – ever reward them with a tolerable page.”

– Jorge Luis Borges, “The Total Library,” 1939.

"This problem can only be properly undertaken when an approximate knowledge of the orbit has been already attained, which is afterward to be corrected so as to satisfy all the observations in the most accurate manner possible.”

– Karl Friedrich Gauss, “Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium,” 1809.

“Pluralitas non est ponenda sine neccesitate” (Ockham’s Razor) – William of Ockham, 14th Century.


Big Data vs. Small Data in Biomedicine

From the statistical point of view, the ratio sample size/dimensionality is fundamental.

Fat Data Matrix = Big Data Tall Data Matrix = Small Data


“The Curse of Dimensionality”

“Peaking Phenomenon” / “Hughes Phenomenon”

U.M. Braga-Neto, "Classification and Error Estimation for Discrete Data,” Current Genomics, Vol. 10, No. 7, Nov 2009, pp. 446-462.

G. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information Theory, 1968, IT-14, 55-63.


AO

Kim, S. et al. "Identification of Combination Gene Sets for Glioma Classification," Molecular Cancer Therapeutics, Vol. 1,��No. 13, 1229-1236, 2002

Simple Classifiers Are Better


Complex classification rule: Simple classification rule:

“Scissors Effect”




No feature selection

With feature selection

With feature selection

Feature selection produces a simpler, more constrained classification rule


Classifier Error Estimation

U.M. Braga-Neto and E. Dougherty, "Is Cross-Validation Valid for Small-Sample Microarray Classification?" Bioinformatics, Vol. 20, No. 3, Feb 2004, pp. 374-380.

Constrained, low-dimensional classifiers facilitate error estimation from limited amounts of sample data.



Dengue Fever Example

E.J.M. Nascimento, U.M. Braga-Neto et al., "Gene Expression Profiling During Acute Stage of Dengue Infection," PLoS ONE, Vol. 4, No. 11, Nov 2009, p. e7892


Breast Cancer Example

van de Vijver, et al. (2002) “A gene-expression signature as apredictor of survival in breast cancer.” New England��Journal of Medicine,��Vol. 347, 1999–2009.

Originallly published 70-gene signature:

independent test-set error = 68/180 = 37.7%


More Accurate Classifier (2 genes)

Error ≈ 52/295 = 17.6% (Resubstitution Error)

U.M. Braga-Neto, Fads and Fallacies in the Name of Small-Sample Microarray Classification. IEEE Signal Processing Magazine, Special Issue on Signal Processing Methods in Genomics and Proteomics, Vol. 24, No. 1, January 2007, pp. 91-99.


Braga-Neto and Dougherty, Error Estimation for Pattern Recognition, Wiley-IEEE Press, 2015.

This book is the first one dedicated exclusively to the topic of error estimation for pattern recognition. It covers both classical and recent results on the performance of error estimators for nonparametric and parametric classifiers.

Many of the issues related to Big Data discussed in this talk are covered in detail in the book.

Pattern Recognition for High-Dimensional Data in...

Documents

Transcript of Pattern Recognition for High-Dimensional Data in...