Machine learning based analysis on Alzheimer’s & Parkinson...

42
Machine learning based analysis on Alzheimer’s & Parkinson’s data Margherita Squillario, Annalisa Barla http://slipguru.disi.unige.it / Thursday, May 26, 2011

Transcript of Machine learning based analysis on Alzheimer’s & Parkinson...

Page 1: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Machine learning based analysis on Alzheimer’s & Parkinson’s data

Margherita Squillario, Annalisa Barla

http://slipguru.disi.unige.it/Thursday, May 26, 2011

Page 2: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

people@slipguru (& compbio group)

FacultyErnesto De VitoFrancesca OdoneAlessandro Verri

Post DocsAnnalisa BarlaCurzio BassoSofia MosciNicoletta NocetiLorenzo RosascoMatteo SantoroMargherita SquillarioVeronica UmanitàSilvia Villa

PhD studentsAlessandro RudiLaura BazzottiGabriele ChiusanoGiovanni FuscoSalvatore MasecchiaSaverio SalzoAlessandra StaglianòLuca ZiniGrzegorz Zycinski

Research GrantsSonia MenocciValentina Russo

Thursday, May 26, 2011

Page 3: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

research topics@slipguru

•study and development of the theoretical foundations of learning and learning methods from the mathematical and statistical viewpoint

•design and development of mathematically sound methods for extracting visual information from images (also medical)

•design of methods and algorithms for extracting biological information underlying a given biological process

Thursday, May 26, 2011

Page 4: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

how we got here..

• we already had a vast knowledge in machine learning and its applications (image understanding)

• 2005: chance to face new problems and solve them with regularization techniques

• Health-e-Child EU IST

• IGG (Molecular Biology Lab)

• IST, DIMI (Unige)

• but the data at hand were somehow different from what we were used to: n<<d (curse of dimensionality)

Thursday, May 26, 2011

Page 5: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Feature Selection Step

Thursday, May 26, 2011

Page 6: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

molecular biology

• a typical scenario is n<<d

• number of samples cannot always be increased (rare diseases and expensive technology)

• (mostly) high-throughput data

✤ new technologies (DNA microarrays, CGH, SNP, etc.)✤ possibility to measure the whole genome✤ most of the times the data are noisy (getting better any day now..)

biological samples

microarray gene expression

computational methods

Relevant Gene List230746_s_at STC1

230710_at ---

230630_at AK3L2

228499_at PFKFB4

228483_s_at TAF9B

227337_at ANKRD37

227068_at PGK1

226632_at CYGB

226452_at PDK1

226348_at ---

226347_at ---

selected genes

Thursday, May 26, 2011

Page 7: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

learning from examples paradigm

the GOAL is not to memorize but to GENERALIZE, e.g. predict

given a set of examples:

find a function:

such that f is a good predictor on new data as well as on the given dataset

and possibly identify the most discriminating variables (genes)--> gene signature

finputx

outputy

{(x1,y1), (x2,y2),...., (xn,yn)}

f(x)~y

Thursday, May 26, 2011

Page 8: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

why going multivariate?

search for DIFFERENTIALLY EXPRESSED GENES is not always sufficient, univariate approaches may not be flexible enough...

gene 1

gene 2

Thursday, May 26, 2011

Page 9: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

l1l2 variable selection method

Empirical Risk minimization combined with a mixed penalty:

• l1 term enforcing sparsity

• l2 term preserving correlation

Zou, H, Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 2005.

De Mol, C. Devito, E., Rosasco, L.Elastic-net regularization in learning theoryJournal of Complexity, 2009

l1 norm

l2 norm

x1

x2

f(x)=β1⋅x1

f(x)=β1⋅x1+β2⋅x2

Thursday, May 26, 2011

Page 10: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

l1l2 variable selection method

Empirical Risk minimization combined with a mixed penalty:

• l1 term enforcing sparsity

• l2 term preserving correlation

Consistency guaranteed (the more samples available the better the estimator)

Not univariate: takes into account behavior of many genes at once.

regularizationparameter

correlationparameter

Thursday, May 26, 2011

Page 11: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

l1l2 variable selection method

output: One-parameter family of nested lists with equivalent prediction ability and increasing correlation among genes.

μ→0 : minimal list of prototype genes

μ1<μ2<μ3<... : longer lists including correlated genes

we can tune correlation parameter and vary the list length

Thursday, May 26, 2011

Page 12: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

two stage approach (De Mol, Mosci, Traskine, Verri 2008)

variable selection step (l1l2):

classification step (rls):

for each μwe have to choose λ and τ

correlation parameter

Thursday, May 26, 2011

Page 13: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Gene selection framework

Christine De Mol, Sofia Mosci, Magali Traskine, Alessandro Verri. A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data. Journal of Computational Biology, 2009.

Thursday, May 26, 2011

Page 14: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

the optimal pair (λ*, τ*) is one of the A⋅B possible pairs (λ, τ)ij

λ → ( λ1, ...., λA)τ → ( τ1, ...., τB)

computational time in the LOO case (for one task):

time1-optim =(2.5s÷25s) depending on the correlation parameter

Total Time = A⋅B⋅N.samples⋅time1-optim. ~ 20⋅20⋅30⋅time1-optim ~ 2⋅104s÷2⋅105

statistical learning framework

A Barla, S Mosci, L Rosasco, A Verri. A method for robust variable selection with significance assessment. Proc. of ESANN, 2008.

Thursday, May 26, 2011

Page 15: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

l1l2 variable selection method: l1l2py and pplus

• l1l2py: a Python implementation of l1l2 variable selection method available at http://slipguru.disi.unige.it/Research/L1L2Py

• pplus: a Python wrapper for the python library pp(should be soon available on our website)

Thursday, May 26, 2011

Page 16: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Pathway Enrichment

Thursday, May 26, 2011

Page 17: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

functional characterization of the signature

• WebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". The tool is available at: http://bioinfo.vanderbilt.edu/webgestalt/

• The analysis consists in performing a GSEA on Gene Ontology and/or KEGG, provided the gene signature obtained by l1l2.

1. Zhang, B., Kirov, S.A., Snoddy, J.R. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res, 33(Web Server issue), W741-748. 2005

2. Duncan, D.T., Prodduturi, N., Zhang, B.WebGestalt2: an updated and expanded version of the Web-based Gene Set Analysis Toolkit. BMC Bioinformatics, 11(Suppl 4):P10. 2010

Thursday, May 26, 2011

Page 18: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

AD and PD data: a case study

Thursday, May 26, 2011

Page 19: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

AD and PD data: a case study

• AD: GSE9770 (early stage) and GSE5281 (late stage)platform: Affymetrix HG-U133 Plus 2.0

• PD: GSE6613 (early stage) and GSE5281 (late stage) platform: Affymetrix HG-U133A

controls cases

GSE9770 62 29

GSE5281 62 68

controls cases

GSE6613 22 50

GSE20295 53 40

Uncovering candidate biomarkers for

Alzheimer´s and Parkinson´s diseases

with regularization methods

and prior knowledge poster @ ADPD’11

Thursday, May 26, 2011

Page 20: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

AD and PD data: a case study

Winnie S. Liang et al. 2010. Neuronal gene expression in non-demented individuals with intermediate Alzheimer’s Disease neuropathology.

GSE9770

GSE5281Winnie S. Liang et al. 2008. Alzheimer’s disease is associated with

reduced expression of energy metabolism genes in posterior cingulate neurons.

GSE6613Scherzer C.R. et al. 2006. Molecular markers of early Parkinson’s disease

based on gene expression in blood.

GSE20295Zhang et al. 2005. Transriptional analysis of multiple brain regions in

Parkins’s Disease Supports the Involvement of Specific Protein Processing, Energy Metabolism, and Signaling Pathways, and Suggests

Novel Disease Mechanisms.Thursday, May 26, 2011

Page 21: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Feature Selection Step (l1l2)

Thursday, May 26, 2011

Page 22: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

4 gene signatures specific for the early and late stage of both AD and PD

Results: l1l2 analysis

#genes accuracy (%)

GSE9770 early AD 73 62

GSE5281 late AD 90 80

GSE6613 early PD 132 90

GSE20295 late PD 106 95

Thursday, May 26, 2011

Page 23: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

XIST

RPS4Y1DEFA1/DEFA3

HLA-DQB1

Early PD

Late PD

Early AD

Late ADHBB

PMS2L1/PMS2L2

SCAMP1

XIST

Early PD Early AD

HBB

TAC1

SST

CD44

Late PD

Late AD

Signature Comparison

XIST

SYNCRIP

ACTR2

WNK1

UBE3AMALAT1

RGS1RGS4

XIST

Thursday, May 26, 2011

Page 24: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Signature Comparison

Late PD

Early AD

HBB

OXR1

HBA1/HBA2

Early PD

Late AD

XIST

MOBP

XIST

MBP

Thursday, May 26, 2011

Page 25: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Pathway Enrichment (WebGestalt)

Thursday, May 26, 2011

Page 26: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Functional Characterization (GSE9770)

early AD

Thursday, May 26, 2011

Page 27: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Functional Characterization (GSE5281)

late AD

Thursday, May 26, 2011

Page 28: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Alzheimer’s disease analysis

•The overlapping nodes are few but some of these are more specific and concern the circulatory system, in particular the blood circulation and the neurological system processes that is bounded to the transmission of nerve impulse node.

•The specific ones for the early stage concern:

•the nervous system,

•the blood

•The specific nodes for the late stage are mostly related to:

•negative regulation of several processes (e.g. of cell proliferation, of cell communication, of macromolecule biosynthetic process),

•to neurological system process (connected to the last enriched node named visual perception),

•the behavior,

•the response to stimuli (behavior, drug, hormone)

Thursday, May 26, 2011

Page 29: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Functional Characterization (GSE6613)

early PD

Thursday, May 26, 2011

Page 30: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Functional Characterization (GSE20295)

late PD

Thursday, May 26, 2011

Page 31: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Parkinson’s disease analysis

•The overlapping nodes are few and with a very general meaning (e.g. intracellular, cytoplasm, negative regulation of biological process).

•The specific ones for early stage concern:

•the immune system

•the response to stimulus (i.e. stress, chemicals or other organism like virus),

•the regulation of metabolic processes, the biological quality and the cell death.

•The specific nodes for late stage are related to:

•the nervous system (e.g. neurotransmitter transport, transmission of nerve impulse, learning or memory),

•to the response to stimuli (e.g. behavior, temperature, organic substances, drugs or endogenous stimuli)

Thursday, May 26, 2011

Page 32: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Prior Knowledge Integration

Thursday, May 26, 2011

Page 33: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Function-based analysis via l1-l2 regularization

• combine the selection protocol with the GO structure

• provide a way to easily interpret the output of feature selection protocol

joint work with University of

Padova

Function-based analysis of

microarray data via l1-l2 regularization

poster @ ECCB’09

Gene Ontology

expression data

classification via l1-l2 regularization

SVS python library

Thursday, May 26, 2011

Page 34: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Function-based analysis via l1-l2 regularization

PD data

common

PD late

PD early

Thursday, May 26, 2011

Page 35: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Function-based analysis via l1-l2 regularization

common

AD early

AD late

AD data

Thursday, May 26, 2011

Page 36: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

Gene Network Analysis

Thursday, May 26, 2011

Page 37: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

• Since most of the known diseases are of system nature, we tackled this system biology related problem focusing on the network medicine discipline.

• The aim is the understanding of the molecular pathways (from the transcription to the signaling inside the cells).

• To do so, we estimate these pathways using algorithms for inferring network topology from high-throughput measurements.

Gene Network Analysis (preliminary results)

joint work with FBK, TrentoA machine learning pipeline for discriminant analysis and pathways identificationsubmitted @CIBB2011 and @bioinformatics

Thursday, May 26, 2011

Page 38: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

• machine learning based pipeline for evaluating the disruption of important molecular pathways

• input: microarray measurements (case vs control)

• Pipeline:

Gene Expression Data

Phenotype

ClassificationRegression

Featureselection step

PathwayEnrichment

SubnetworkInference

SubnetworkAnalysis

SRDA, L1L2 GSEA, GSA

Aracne, WGCNA

Density

SubnetworkComparison

Gene Network Analysis (preliminary results)

Thursday, May 26, 2011

Page 39: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

• For each analyzed dataset we identify/build:

• a gene signature (l1l2)

• the most enriched pathways in the signature (WebGestalt/GSEA)

• the network for the selected pathways (considering all the genes involved in each pathway) and we make the comparison between cases and controls.

Gene Network Analysis

Thursday, May 26, 2011

Page 40: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

• The analysis allows for a ranking of the most disrupted pathways.

• Here we present the GO:0019787 node (small conjugating protein ligase activity)

Gene Network Analysis (AD early)

casescontrols

Thursday, May 26, 2011

Page 41: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

casescontrols

Gene Network Analysis (PD)

• The analysis allows for a ranking of the most disrupted pathways.

• Here we present the GO:0045087 node (innate immune response)

Thursday, May 26, 2011

Page 42: Machine learning based analysis on Alzheimer’s & Parkinson ...slipguru.disi.unige.it/Research/papers/ADPD.pdf · Annalisa Barla Curzio Basso ... Journal of the Royal Statistical

ongoing projects

project collaboration publications

AD -- BMC Med Gen*

PD -- poster ECCB 2010

AD/PD comparison -- poster ADPD2011

biomarker stability DEI UniPD/FBK Bioinformatics*

Breast+prior GO DEI UniPD --

network analysis FBK CIBB2011*/Bioinformatics*

CGH neuroblastoma IST --

CGH/SKY cell lines DIMI/NIH --

JIA subtyping IGG+SIGMA poster ISMB/ECCB 2011

Thursday, May 26, 2011