DopazoBioinfoCIPF.ppt [Modo de compatibilidad]€¦ · Overview of the combined in vitro and breast...
Transcript of DopazoBioinfoCIPF.ppt [Modo de compatibilidad]€¦ · Overview of the combined in vitro and breast...
Bioinformatica
Department of Bioinformatics and Genomics, (BIG) Department of Bioinformatics and Genomics, (BIG) Centro de Investigación Príncipe Felipe (CIPF), and
Functional genomics node, (INB), Valencia SpainValencia, Spain.
http://www.gepas.org.http://www babelomics orghttp://www.babelomics.org
http://bioinfo.cipf.es
G i th
The pre-genomics paradigmGenes in the
DNA...
…code for proteins... …produces the final
From genotype >protein kunase
acctgttgatggcgacagggactgtatgctgatct
pphenotype
to phenotype.g g gg g ggg g g g
atgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....
…whose structure accounts for function...
…plus the environment...
Genes in the DNA...
whose final
Next Generation Sequencing109bp per round
…which can be different because of the variability.
15 million SNPs
>protein kunase
acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....
…when expressed in the
…whose final effect
configures the phenotypep
proper moment and place...
phenotype...
A typical tissue is i
From genotype to phenotype
expressing among 5000 and 10000 genes
to phenotype.(in the functional post-
genomics scenario) …conforming complex …code for g pinteraction networks...proteins...
That undergo post-
…in cooperation with other
i
translational modifications, somatic
recombination...
100K-500K proteins
proteins…
Each protein has an average of 8 interactions
…whose structures account for function...
Gene Set
Aff t i
T-Rex
Two classes
Multi classes
CorrelationSurvival
Gene Set enrichment
PreprocessorAffymetrix
FatiScan
GSEA
KNN DLDA SVMRandom forest
Prophet
Correlation
Agilent
Two-colour FatiGO+
M iISACGH
BabelomicsAgilent
Hierarchical SOM
Two-colour arrays
Raw dataFatiGO
Marmite
TMTCAAT
ISACGH
Hierarchical
SOTAK-meansFunctional enrichment
N li ti
Differential expresionClusteringNormalization
Arrays-CGHHerrero et al 2003 2004; Vaquerizas et al 2005
GEPAS
Class PredictionFunctional Annotation
y
RIDGE analysis
Herrero et al., 2003, 2004; Vaquerizas et al., 2005 NAR; Montaner et al., 2006 NAR; Al-Shahrour et al., 2005, 2006, 2007 NAR; 2005 Bioinformatics, 2007 BMC Bioinformatics; Tarraga et al., NAR 2008; Al-Shahrour et al., NAR 2008
BLAST2GO: Automatic functional annotation
Clustering of experiments:Distinctive gene expression patterns in human mammary epithelial cells and breast cancers
Overview of the combined in vitro and breast tissue specimencluster diagram. A scaled-down representation of the 1,247-gene l t di Th bl k b h th iti f th l tcluster diagram The black bars show the positions of the clusters
discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B lymphocytes, and (D) stromal cells.
Symbolic representation
Gene selection.Th i l t i i t bThe simplest way: univariant gene-by-gene. Other multivariant approaches
can be used
• Two classesT-testBBayesData-adaptiveClear
M lti l• MulticlassAnovaClear
• Continuous variable (e.g. level of a metabolite)
PearsonSpearmamRegression
• SurvivalCox model
The T-rex tool
NE EECGenes differentially G Symbol A Numberexpressed between normal
endometrium and
G Symbol A Number
endometrioid endometrial carcinomas
86 genes with significantly different86 genes with significantly different expression patterns between Normal Endometrium and Endometrioid Endometrial Carcinoma (FDR adjustedEndometrial Carcinoma (FDR adjusted p<0.05) selected among the ~7000 genes in the CNIO oncochip
Moreno et al., 2003 Cancer Research 63, 5697-5702
i Q i C ( QC) jThe MicroArray Quality Control (MAQC) Project:An FDA-Led Effort Toward Personalized Medicine
MAQC Website: http://edkb.fda.gov/MAQC/MAQC-II Objective:Reaching consensus on the “best practices” (Data Analysis Protocol, DAP) in developing and validating microarray-based predictive models (classifiers) for clinical and preclinical applications.
A international consortium of 36 data analysisdata analysis teams submitted prediction results from 18 202from 18,202 models for 6 datasets to the MAQC-IIMAQC II
Studying copy number alterationsy g py
Correlation CNA – expression.
amplification
deletion
p
Minimum region with gains and losses
deletion
Zoom of the region
PTL LBCUnderstanding why genes differ in their genes differ in their expression between
two different two different conditions
Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL)(PTL).
Genes differentially expressed, selected among the ~7000 genes in the g gCNIO oncochip
Genes differentially expressed among b th i l l t d tboth groups were mainly related to immune response (activated in mature lymphocytes)
Martinez et al., Clinical Cancer Research. 10: 4971-4982.
Genome Annotation KEGG pathways
Protein-Protein interactions
Functional Annotation
Structural Annotation
KeywordsProtein Structure
Biological Databases
Biocarta pathways
Keywords Swissprot
Structure
Gene OntologyBiological Process Molecular Function
Motifs
Domains
Gene Annotation
Cellular Component
Bioentities from literature
Gene Expression
Gene Set Annotation
pModules Regulatory elements
miRNA
CisRed
T i ti F t Bi di SitReactome
Transcription Factor Binding Sites
mSigDB
Case study: functional differences in a class comparison experiment
A BN i l h i ifi t diff ti l
A
8 with impaired
No one single gene shows significant differentialexpression upon the application of a t-test
Repository Healthy vs Functional class GO KEGG Swissprot
tolerance (IGT) + 18 with type 2 diabetes
diabetic Functional class GO KEGG keywordOxidative
phosphorylation X X
ATP synthesis X Ribosome X
Ubiquinone X
mellitus (DM2)Ubiquinone X
Ribosomal protein X Ribonucleoprotein X
Mitochondrion X X Transit peptide X
Nucleotide bi h i X
Up-regulated
B
biosynthesis X
NADH dehidrogenase (ubiquinone)
activity
X
Nuclease activity XB
17 with normal tolerance to Nevertheless, many pathways, and functional blocks
yDow-
regulated Insulin signalling
pathway X
tolerance to glucose (NTG)
are significantly activated/deactivated
(Mootha et al., 2003)
Beyond discrete variables: Survival data
Mi Since FatiScan depends only on a list of ordered genes and notMicroarrays34 samples fromtumours of
Since FatiScan depends only on a list of ordered genes, and noton the original experimental values, it can be applied to differentexperimental designs
hypopharyngeal cancer (GEO GDS1070))
Gen risk
Gen1 5.8
- SurvivalGene selection
Cox Proportional-Hazards model to
Gen2 5’6
Gen3 5.4
Gen4 5.2
Gen5 5.2
Gen6 5.0
selection
Hazards model to study how theexpression of each
…… ….
…… ….…… ….Gen1000 -6.0Gen1001 -6.3
gene across patients is related to their survival
+ Survival
Evaluación del comportamiento cooperativo de la lista
Metodologías – redes de interacciones entre proteínas
Evaluación del comportamiento cooperativo de la lista
R d d C ió Mí i ( ód l f i l)
Generación del “módulo funcional” (términos GO, rutas de KEGG o bioentidades)
Red de Conexión Mínima (módulo funcional)
Hallamos los caminos más cortos entre todos los pares de nodos en la lista.Aceptamos los caminos que conectan dos nodos bien directamente o por medio de un número determinado de nodos no incluidos en la lista.
Caminos cortosLista de
RCM
proteínas seleccionadas
Nodos incluidos en la lista
Nodos no incluidos en la lista
Prot 1
Prot 2
19 de 36
Variación conexiones físicas en rutas bioquímicas (normales vs cáncer)
Cellular Processes (connection gains in cancer) Prostate Mammary Gland
1
2
1
2 42
5 6
4 3 25 6
43
7 7
1 - auto-connections in Cell cycley2 - Cell cycle - Tight junction3 - Gap junction - Insulin signaling pathway4 - Gap junction - Fc epsilon RI signaling pathway5 - Toll like receptor signaling pathway auto-connections6 - Toll like receptor signaling pathway - B cell receptor signaling7 - Insulin signaling pathway - Melanogenesis7 - Insulin signaling pathway - Melanogenesis
The babelomics suite for functional profiling of genomic experiments
Biological information from:Biological information from:• GO • Interpro motifs• KEGG pathways
Bi t th• Biocarta pathways• Swissprot keywords• TFBSs (Transfac)• Regulatory motifs (CisRED)• Regulatory motifs (CisRED)• miRNAs• Protein interactions • Tissues• Text-mining• Chromosomal locationFor
http://www.babelomics.org
Human, mouse, rat, chicken, cow, fly, worm, yeast, A. thaliana and bacteria
http://www.babelomics.org
Al-Shahrour et al., 2005, 2006, 2007 , 2008 NAR; 2004, 2005 Bioinformatics, 2007 BMC Bioinformatics;
Tests for • functional enrichment • gene set enrichment
Expanding the concept of f l f lfunctional profiling
•Better functional annotations will help•Testing models will increase our sensitivity•Functions and pathways are correlated (higher levels of organization).
In general (systems) biology is behind. Our questions g ( y ) gy qmust be inspired directly by biology
Successful reception by the
GEPAS: currently is the most cited web-based
scientific community y
platform for transcriptomic analysis (482 scholar google citations)Babelomics. Third most cited platform (575 scholar google citations; FatiGO is amongst the 50 mostgoogle citations; FatiGO is amongst the 50 most cited papers in Bioinformatics)
Microarray data analysis webtools with at least 10 citations1.
Web tool URL Citations1
GEPAS http://www.gepas.org 482ExpressionProfiler http://www.ebi.ac.uk/expressionprofiler 52
GEDA h //bi i f i d /GEDA h l 36
Approximately1000 users per day
caGEDA http://bioinformatics.upmc.edu/GEDA.html 36GenePublisher http://www.cbs.dtu.dk/services/GenePublisher 25ExpressYourself http://bioinfo.mbb.yale.edu/expressyourself 26RACE http://race unil ch/ 22
y1500 registered users (6 months)
PublicationsRACE http://race.unil.ch/ 22ArrayPipe http://www.pathogenomics.ca/arraypipe 20VAMPIRE http://genome.ucsd.edu/microarray/ 19MIDAW http://muscle.cribi.unipd.it/midaw/ 15
2008 – 62007 – 62006 – 52005 – 5
t-profiler http://www.t-profiler.org 16CARMAweb https://carmaweb.genome.tugraz.at 121) Scholar Google citations over all the references of the tool (June 2008).
Functional GenomicsSNP analysis
PupaSuite
y
Interactive selection of optimal sets of SNPs for large-scale genotyping
SNPeffect database.Phenotyping of human SNPs and disease mutations
Genome projects. Design, and implementation of:implementation of:
Workflow for High-throughput genotyping at CeGenCeGen.
Problem 1: feed the monster. E.g. Illumina: 150.000
genotipes at a i
Problem 2: store
results...
Problem 3: query
the database
time
Computer-aided
database...
Cancer
Experimental design (li k
selection. PupaSuite
Conde et al. 2004 2005
Cancer SNPs DB server
(linkage, pathway,
etc)
2004, 2005, 2007 NAR
and ...along with
clinical data
...and submit to analysis
programs
LD, Case-control, haplotypes, ODD
ratios, etc.
October 2004: 45.000 SNPs designed
Functional GenomicssiRNA analysisy
Si-RNA DEsign
SiDEHighly specific and accurate selection of siRNAs for high-throughput functional assays
http://bioinfo.cipf.es
Next generation sequencing:throughput up to 109bp/day
Illumina Genome Analyzer (Solexa)
Genome Sequencer FLX System“454”
SOLID. Applied Biosystems
Re-sequencing, de novo sequencing, CNVs, SNPs, transciptomics, Chip-on-chip, etc.
Next Generation SequencingNext Generation Sequencing
For:
•Transcriptomics•Resequencing
•SNPs•Copy number •Chip on chip•Chip-on-chip like
Technological services
Sequencing Microarray Computing
Applied Biosystems 3730XL DNA analyzer1 6 Mbp pb per day
Labelling, Hybridization, Scan
Commertial arrays
Computing cluster230 CPU’s20 servers x2 xeon quad-core20 2 t
Facility Facility Power
1.6 Mbp pb per day
SequencingPlasmidCosmidBAC ends
Commertial arraysAglientOperonEppendorfClontechGE HealthcareDNA microarrays
20 servers x 2 opteron process30 pcs x1 Athlon proces
20 Tbs disk storage
x86_64 arquitechtureSNPsmicrosatellitemethylation profilesfragment analysis
DNA microarrays
ChIp on ChipExon arraysmicroRNAsBAC arrays
_ q
0.5 M Blast runs against nr in 24 hours
Array design: probe selection
http://bioinfo.cipf.es
The bioinformatics department at the Centro de Investigación Príncipe Felipe (Valencia, Spain)...Investigación Príncipe Felipe (Valencia, Spain)...
Joaquín DopazoEva AllozaEva AllozaLeonardo ArbizaFátima Al-ShahrourEmidio Capriotti
...the INB, National Institute of Bioinformatics(Functional Genomics Node) and the CIBER-ER Emidio Capriotti
Jose CarbonellAna ConesaHernán Dopazo
( )Nertwork of Centers for Rare Diseases
Hernán DopazoPablo EscobarFrancisco GarcíaStefan GoetzJaime HuertaRafael JimenezMarc MartíIgnacio MedinaPablo MinguezDavid MontanerFrançois SerraJoaquín Tárraga