Curzio Malaparte - Coup d’Etat - The Technique Of Revolution (2004)
Machine learning based analysis on Alzheimer’s & Parkinson...
-
Upload
nguyendiep -
Category
Documents
-
view
219 -
download
0
Transcript of Machine learning based analysis on Alzheimer’s & Parkinson...
Machine learning based analysis on Alzheimer’s & Parkinson’s data
Margherita Squillario, Annalisa Barla
http://slipguru.disi.unige.it/Thursday, May 26, 2011
people@slipguru (& compbio group)
FacultyErnesto De VitoFrancesca OdoneAlessandro Verri
Post DocsAnnalisa BarlaCurzio BassoSofia MosciNicoletta NocetiLorenzo RosascoMatteo SantoroMargherita SquillarioVeronica UmanitàSilvia Villa
PhD studentsAlessandro RudiLaura BazzottiGabriele ChiusanoGiovanni FuscoSalvatore MasecchiaSaverio SalzoAlessandra StaglianòLuca ZiniGrzegorz Zycinski
Research GrantsSonia MenocciValentina Russo
Thursday, May 26, 2011
research topics@slipguru
•study and development of the theoretical foundations of learning and learning methods from the mathematical and statistical viewpoint
•design and development of mathematically sound methods for extracting visual information from images (also medical)
•design of methods and algorithms for extracting biological information underlying a given biological process
Thursday, May 26, 2011
how we got here..
• we already had a vast knowledge in machine learning and its applications (image understanding)
• 2005: chance to face new problems and solve them with regularization techniques
• Health-e-Child EU IST
• IGG (Molecular Biology Lab)
• IST, DIMI (Unige)
• but the data at hand were somehow different from what we were used to: n<<d (curse of dimensionality)
Thursday, May 26, 2011
Feature Selection Step
Thursday, May 26, 2011
molecular biology
• a typical scenario is n<<d
• number of samples cannot always be increased (rare diseases and expensive technology)
• (mostly) high-throughput data
✤ new technologies (DNA microarrays, CGH, SNP, etc.)✤ possibility to measure the whole genome✤ most of the times the data are noisy (getting better any day now..)
biological samples
microarray gene expression
computational methods
Relevant Gene List230746_s_at STC1
230710_at ---
230630_at AK3L2
228499_at PFKFB4
228483_s_at TAF9B
227337_at ANKRD37
227068_at PGK1
226632_at CYGB
226452_at PDK1
226348_at ---
226347_at ---
selected genes
Thursday, May 26, 2011
learning from examples paradigm
the GOAL is not to memorize but to GENERALIZE, e.g. predict
given a set of examples:
find a function:
such that f is a good predictor on new data as well as on the given dataset
and possibly identify the most discriminating variables (genes)--> gene signature
finputx
outputy
{(x1,y1), (x2,y2),...., (xn,yn)}
f(x)~y
Thursday, May 26, 2011
why going multivariate?
search for DIFFERENTIALLY EXPRESSED GENES is not always sufficient, univariate approaches may not be flexible enough...
gene 1
gene 2
Thursday, May 26, 2011
l1l2 variable selection method
Empirical Risk minimization combined with a mixed penalty:
• l1 term enforcing sparsity
• l2 term preserving correlation
Zou, H, Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 2005.
De Mol, C. Devito, E., Rosasco, L.Elastic-net regularization in learning theoryJournal of Complexity, 2009
l1 norm
l2 norm
x1
x2
f(x)=β1⋅x1
f(x)=β1⋅x1+β2⋅x2
Thursday, May 26, 2011
l1l2 variable selection method
Empirical Risk minimization combined with a mixed penalty:
• l1 term enforcing sparsity
• l2 term preserving correlation
Consistency guaranteed (the more samples available the better the estimator)
Not univariate: takes into account behavior of many genes at once.
regularizationparameter
correlationparameter
Thursday, May 26, 2011
l1l2 variable selection method
output: One-parameter family of nested lists with equivalent prediction ability and increasing correlation among genes.
μ→0 : minimal list of prototype genes
μ1<μ2<μ3<... : longer lists including correlated genes
we can tune correlation parameter and vary the list length
Thursday, May 26, 2011
two stage approach (De Mol, Mosci, Traskine, Verri 2008)
variable selection step (l1l2):
classification step (rls):
for each μwe have to choose λ and τ
correlation parameter
Thursday, May 26, 2011
Gene selection framework
Christine De Mol, Sofia Mosci, Magali Traskine, Alessandro Verri. A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data. Journal of Computational Biology, 2009.
Thursday, May 26, 2011
the optimal pair (λ*, τ*) is one of the A⋅B possible pairs (λ, τ)ij
λ → ( λ1, ...., λA)τ → ( τ1, ...., τB)
computational time in the LOO case (for one task):
time1-optim =(2.5s÷25s) depending on the correlation parameter
Total Time = A⋅B⋅N.samples⋅time1-optim. ~ 20⋅20⋅30⋅time1-optim ~ 2⋅104s÷2⋅105
statistical learning framework
A Barla, S Mosci, L Rosasco, A Verri. A method for robust variable selection with significance assessment. Proc. of ESANN, 2008.
Thursday, May 26, 2011
l1l2 variable selection method: l1l2py and pplus
• l1l2py: a Python implementation of l1l2 variable selection method available at http://slipguru.disi.unige.it/Research/L1L2Py
• pplus: a Python wrapper for the python library pp(should be soon available on our website)
Thursday, May 26, 2011
Pathway Enrichment
Thursday, May 26, 2011
functional characterization of the signature
• WebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". The tool is available at: http://bioinfo.vanderbilt.edu/webgestalt/
• The analysis consists in performing a GSEA on Gene Ontology and/or KEGG, provided the gene signature obtained by l1l2.
1. Zhang, B., Kirov, S.A., Snoddy, J.R. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res, 33(Web Server issue), W741-748. 2005
2. Duncan, D.T., Prodduturi, N., Zhang, B.WebGestalt2: an updated and expanded version of the Web-based Gene Set Analysis Toolkit. BMC Bioinformatics, 11(Suppl 4):P10. 2010
Thursday, May 26, 2011
AD and PD data: a case study
Thursday, May 26, 2011
AD and PD data: a case study
• AD: GSE9770 (early stage) and GSE5281 (late stage)platform: Affymetrix HG-U133 Plus 2.0
• PD: GSE6613 (early stage) and GSE5281 (late stage) platform: Affymetrix HG-U133A
controls cases
GSE9770 62 29
GSE5281 62 68
controls cases
GSE6613 22 50
GSE20295 53 40
Uncovering candidate biomarkers for
Alzheimer´s and Parkinson´s diseases
with regularization methods
and prior knowledge poster @ ADPD’11
Thursday, May 26, 2011
AD and PD data: a case study
Winnie S. Liang et al. 2010. Neuronal gene expression in non-demented individuals with intermediate Alzheimer’s Disease neuropathology.
GSE9770
GSE5281Winnie S. Liang et al. 2008. Alzheimer’s disease is associated with
reduced expression of energy metabolism genes in posterior cingulate neurons.
GSE6613Scherzer C.R. et al. 2006. Molecular markers of early Parkinson’s disease
based on gene expression in blood.
GSE20295Zhang et al. 2005. Transriptional analysis of multiple brain regions in
Parkins’s Disease Supports the Involvement of Specific Protein Processing, Energy Metabolism, and Signaling Pathways, and Suggests
Novel Disease Mechanisms.Thursday, May 26, 2011
Feature Selection Step (l1l2)
Thursday, May 26, 2011
4 gene signatures specific for the early and late stage of both AD and PD
Results: l1l2 analysis
#genes accuracy (%)
GSE9770 early AD 73 62
GSE5281 late AD 90 80
GSE6613 early PD 132 90
GSE20295 late PD 106 95
Thursday, May 26, 2011
XIST
RPS4Y1DEFA1/DEFA3
HLA-DQB1
Early PD
Late PD
Early AD
Late ADHBB
PMS2L1/PMS2L2
SCAMP1
XIST
Early PD Early AD
HBB
TAC1
SST
CD44
Late PD
Late AD
Signature Comparison
XIST
SYNCRIP
ACTR2
WNK1
UBE3AMALAT1
RGS1RGS4
XIST
Thursday, May 26, 2011
Signature Comparison
Late PD
Early AD
HBB
OXR1
HBA1/HBA2
Early PD
Late AD
XIST
MOBP
XIST
MBP
Thursday, May 26, 2011
Pathway Enrichment (WebGestalt)
Thursday, May 26, 2011
Functional Characterization (GSE9770)
early AD
Thursday, May 26, 2011
Functional Characterization (GSE5281)
late AD
Thursday, May 26, 2011
Alzheimer’s disease analysis
•The overlapping nodes are few but some of these are more specific and concern the circulatory system, in particular the blood circulation and the neurological system processes that is bounded to the transmission of nerve impulse node.
•The specific ones for the early stage concern:
•the nervous system,
•the blood
•The specific nodes for the late stage are mostly related to:
•negative regulation of several processes (e.g. of cell proliferation, of cell communication, of macromolecule biosynthetic process),
•to neurological system process (connected to the last enriched node named visual perception),
•the behavior,
•the response to stimuli (behavior, drug, hormone)
Thursday, May 26, 2011
Functional Characterization (GSE6613)
early PD
Thursday, May 26, 2011
Functional Characterization (GSE20295)
late PD
Thursday, May 26, 2011
Parkinson’s disease analysis
•The overlapping nodes are few and with a very general meaning (e.g. intracellular, cytoplasm, negative regulation of biological process).
•The specific ones for early stage concern:
•the immune system
•the response to stimulus (i.e. stress, chemicals or other organism like virus),
•the regulation of metabolic processes, the biological quality and the cell death.
•The specific nodes for late stage are related to:
•the nervous system (e.g. neurotransmitter transport, transmission of nerve impulse, learning or memory),
•to the response to stimuli (e.g. behavior, temperature, organic substances, drugs or endogenous stimuli)
Thursday, May 26, 2011
Prior Knowledge Integration
Thursday, May 26, 2011
Function-based analysis via l1-l2 regularization
• combine the selection protocol with the GO structure
• provide a way to easily interpret the output of feature selection protocol
joint work with University of
Padova
Function-based analysis of
microarray data via l1-l2 regularization
poster @ ECCB’09
Gene Ontology
expression data
classification via l1-l2 regularization
SVS python library
Thursday, May 26, 2011
Function-based analysis via l1-l2 regularization
PD data
common
PD late
PD early
Thursday, May 26, 2011
Function-based analysis via l1-l2 regularization
common
AD early
AD late
AD data
Thursday, May 26, 2011
Gene Network Analysis
Thursday, May 26, 2011
• Since most of the known diseases are of system nature, we tackled this system biology related problem focusing on the network medicine discipline.
• The aim is the understanding of the molecular pathways (from the transcription to the signaling inside the cells).
• To do so, we estimate these pathways using algorithms for inferring network topology from high-throughput measurements.
Gene Network Analysis (preliminary results)
joint work with FBK, TrentoA machine learning pipeline for discriminant analysis and pathways identificationsubmitted @CIBB2011 and @bioinformatics
Thursday, May 26, 2011
• machine learning based pipeline for evaluating the disruption of important molecular pathways
• input: microarray measurements (case vs control)
• Pipeline:
Gene Expression Data
Phenotype
ClassificationRegression
Featureselection step
PathwayEnrichment
SubnetworkInference
SubnetworkAnalysis
SRDA, L1L2 GSEA, GSA
Aracne, WGCNA
Density
SubnetworkComparison
Gene Network Analysis (preliminary results)
Thursday, May 26, 2011
• For each analyzed dataset we identify/build:
• a gene signature (l1l2)
• the most enriched pathways in the signature (WebGestalt/GSEA)
• the network for the selected pathways (considering all the genes involved in each pathway) and we make the comparison between cases and controls.
Gene Network Analysis
Thursday, May 26, 2011
• The analysis allows for a ranking of the most disrupted pathways.
• Here we present the GO:0019787 node (small conjugating protein ligase activity)
Gene Network Analysis (AD early)
casescontrols
Thursday, May 26, 2011
casescontrols
Gene Network Analysis (PD)
• The analysis allows for a ranking of the most disrupted pathways.
• Here we present the GO:0045087 node (innate immune response)
Thursday, May 26, 2011
ongoing projects
project collaboration publications
AD -- BMC Med Gen*
PD -- poster ECCB 2010
AD/PD comparison -- poster ADPD2011
biomarker stability DEI UniPD/FBK Bioinformatics*
Breast+prior GO DEI UniPD --
network analysis FBK CIBB2011*/Bioinformatics*
CGH neuroblastoma IST --
CGH/SKY cell lines DIMI/NIH --
JIA subtyping IGG+SIGMA poster ISMB/ECCB 2011
Thursday, May 26, 2011