DopazoBioinfoCIPF.ppt [Modo de compatibilidad]€¦ · Overview of the combined in vitro and breast...

27
Bioinformatica Department of Bioinformatics and Genomics, (BIG) Department of Bioinformatics and Genomics, (BIG) Centro de Investigación Príncipe Felipe (CIPF), and Functional genomics node, (INB), Valencia Spain Valencia, Spain. http://www.gepas.org. http://www babelomics org http://www.babelomics.org http://bioinfo.cipf.es

Transcript of DopazoBioinfoCIPF.ppt [Modo de compatibilidad]€¦ · Overview of the combined in vitro and breast...

Bioinformatica

Department of Bioinformatics and Genomics, (BIG) Department of Bioinformatics and Genomics, (BIG) Centro de Investigación Príncipe Felipe (CIPF), and

Functional genomics node, (INB), Valencia SpainValencia, Spain.

http://www.gepas.org.http://www babelomics orghttp://www.babelomics.org

http://bioinfo.cipf.es

G i th

The pre-genomics paradigmGenes in the

DNA...

…code for proteins... …produces the final

From genotype >protein kunase

acctgttgatggcgacagggactgtatgctgatct

pphenotype

to phenotype.g g gg g ggg g g g

atgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

…whose structure accounts for function...

…plus the environment...

Genes in the DNA...

whose final

Next Generation Sequencing109bp per round

…which can be different because of the variability.

15 million SNPs

>protein kunase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

…when expressed in the

…whose final effect

configures the phenotypep

proper moment and place...

phenotype...

A typical tissue is i

From genotype to phenotype

expressing among 5000 and 10000 genes

to phenotype.(in the functional post-

genomics scenario) …conforming complex …code for g pinteraction networks...proteins...

That undergo post-

…in cooperation with other

i

translational modifications, somatic

recombination...

100K-500K proteins

proteins…

Each protein has an average of 8 interactions

…whose structures account for function...

El futuro que nos viene

Gene Set

Aff t i

T-Rex

Two classes

Multi classes

CorrelationSurvival

Gene Set enrichment

PreprocessorAffymetrix

FatiScan

GSEA

KNN DLDA SVMRandom forest

Prophet

Correlation

Agilent

Two-colour FatiGO+

M iISACGH

BabelomicsAgilent

Hierarchical SOM

Two-colour arrays

Raw dataFatiGO

Marmite

TMTCAAT

ISACGH

Hierarchical

SOTAK-meansFunctional enrichment

N li ti

Differential expresionClusteringNormalization

Arrays-CGHHerrero et al 2003 2004; Vaquerizas et al 2005

GEPAS

Class PredictionFunctional Annotation

y

RIDGE analysis

Herrero et al., 2003, 2004; Vaquerizas et al., 2005 NAR; Montaner et al., 2006 NAR; Al-Shahrour et al., 2005, 2006, 2007 NAR; 2005 Bioinformatics, 2007 BMC Bioinformatics; Tarraga et al., NAR 2008; Al-Shahrour et al., NAR 2008

BLAST2GO: Automatic functional annotation

Clustering of experiments:Distinctive gene expression patterns in human mammary epithelial cells and breast cancers

Overview of the combined in vitro and breast tissue specimencluster diagram. A scaled-down representation of the 1,247-gene l t di Th bl k b h th iti f th l tcluster diagram The black bars show the positions of the clusters

discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B lymphocytes, and (D) stromal cells.

Symbolic representation

Gene selection.Th i l t i i t bThe simplest way: univariant gene-by-gene. Other multivariant approaches

can be used

• Two classesT-testBBayesData-adaptiveClear

M lti l• MulticlassAnovaClear

• Continuous variable (e.g. level of a metabolite)

PearsonSpearmamRegression

• SurvivalCox model

The T-rex tool

NE EECGenes differentially G Symbol A Numberexpressed between normal

endometrium and

G Symbol A Number

endometrioid endometrial carcinomas

86 genes with significantly different86 genes with significantly different expression patterns between Normal Endometrium and Endometrioid Endometrial Carcinoma (FDR adjustedEndometrial Carcinoma (FDR adjusted p<0.05) selected among the ~7000 genes in the CNIO oncochip

Moreno et al., 2003 Cancer Research 63, 5697-5702

Prognostic and diagnostic predictors

The PROPHET and the MAQCII InitiativeMedina 2007 Bioinormatics

i Q i C ( QC) jThe MicroArray Quality Control (MAQC) Project:An FDA-Led Effort Toward Personalized Medicine

MAQC Website: http://edkb.fda.gov/MAQC/MAQC-II Objective:Reaching consensus on the “best practices” (Data Analysis Protocol, DAP) in developing and validating microarray-based predictive models (classifiers) for clinical and preclinical applications.

A international consortium of 36 data analysisdata analysis teams submitted prediction results from 18 202from 18,202 models for 6 datasets to the MAQC-IIMAQC II

Studying copy number alterationsy g py

Correlation CNA – expression.

amplification

deletion

p

Minimum region with gains and losses

deletion

Zoom of the region

PTL LBCUnderstanding why genes differ in their genes differ in their expression between

two different two different conditions

Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL)(PTL).

Genes differentially expressed, selected among the ~7000 genes in the g gCNIO oncochip

Genes differentially expressed among b th i l l t d tboth groups were mainly related to immune response (activated in mature lymphocytes)

Martinez et al., Clinical Cancer Research. 10: 4971-4982.

Genome Annotation KEGG pathways

Protein-Protein interactions

Functional Annotation

Structural Annotation

KeywordsProtein Structure

Biological Databases

Biocarta pathways

Keywords Swissprot

Structure

Gene OntologyBiological Process Molecular Function

Motifs

Domains

Gene Annotation

Cellular Component

Bioentities from literature

Gene Expression

Gene Set Annotation

pModules Regulatory elements

miRNA

CisRed

T i ti F t Bi di SitReactome

Transcription Factor Binding Sites

mSigDB

Case study: functional differences in a class comparison experiment

A BN i l h i ifi t diff ti l

A

8 with impaired

No one single gene shows significant differentialexpression upon the application of a t-test

Repository Healthy vs Functional class GO KEGG Swissprot

tolerance (IGT) + 18 with type 2 diabetes

diabetic Functional class GO KEGG keywordOxidative

phosphorylation X X

ATP synthesis X Ribosome X

Ubiquinone X

mellitus (DM2)Ubiquinone X

Ribosomal protein X Ribonucleoprotein X

Mitochondrion X X Transit peptide X

Nucleotide bi h i X

Up-regulated

B

biosynthesis X

NADH dehidrogenase (ubiquinone)

activity

X

Nuclease activity XB

17 with normal tolerance to Nevertheless, many pathways, and functional blocks

yDow-

regulated Insulin signalling

pathway X

tolerance to glucose (NTG)

are significantly activated/deactivated

(Mootha et al., 2003)

Beyond discrete variables: Survival data

Mi Since FatiScan depends only on a list of ordered genes and notMicroarrays34 samples fromtumours of

Since FatiScan depends only on a list of ordered genes, and noton the original experimental values, it can be applied to differentexperimental designs

hypopharyngeal cancer (GEO GDS1070))

Gen risk

Gen1 5.8

- SurvivalGene selection

Cox Proportional-Hazards model to

Gen2 5’6

Gen3 5.4

Gen4 5.2

Gen5 5.2

Gen6 5.0

selection

Hazards model to study how theexpression of each

…… ….

…… ….…… ….Gen1000 -6.0Gen1001 -6.3

gene across patients is related to their survival

+ Survival

Evaluación del comportamiento cooperativo de la lista

Metodologías – redes de interacciones entre proteínas

Evaluación del comportamiento cooperativo de la lista

R d d C ió Mí i ( ód l f i l)

Generación del “módulo funcional” (términos GO, rutas de KEGG o bioentidades)

Red de Conexión Mínima (módulo funcional)

Hallamos los caminos más cortos entre todos los pares de nodos en la lista.Aceptamos los caminos que conectan dos nodos bien directamente o por medio de un número determinado de nodos no incluidos en la lista.

Caminos cortosLista de

RCM

proteínas seleccionadas

Nodos incluidos en la lista

Nodos no incluidos en la lista

Prot 1

Prot 2

19 de 36

Variación conexiones físicas en rutas bioquímicas (normales vs cáncer)

Cellular Processes (connection gains in cancer) Prostate Mammary Gland

1

2

1

2 42

5 6

4 3 25 6

43

7 7

1 - auto-connections in Cell cycley2 - Cell cycle - Tight junction3 - Gap junction - Insulin signaling pathway4 - Gap junction - Fc epsilon RI signaling pathway5 - Toll like receptor signaling pathway auto-connections6 - Toll like receptor signaling pathway - B cell receptor signaling7 - Insulin signaling pathway - Melanogenesis7 - Insulin signaling pathway - Melanogenesis

The babelomics suite for functional profiling of genomic experiments

Biological information from:Biological information from:• GO • Interpro motifs• KEGG pathways

Bi t th• Biocarta pathways• Swissprot keywords• TFBSs (Transfac)• Regulatory motifs (CisRED)• Regulatory motifs (CisRED)• miRNAs• Protein interactions • Tissues• Text-mining• Chromosomal locationFor

http://www.babelomics.org

Human, mouse, rat, chicken, cow, fly, worm, yeast, A. thaliana and bacteria

http://www.babelomics.org

Al-Shahrour et al., 2005, 2006, 2007 , 2008 NAR; 2004, 2005 Bioinformatics, 2007 BMC Bioinformatics;

Tests for • functional enrichment • gene set enrichment

Expanding the concept of f l f lfunctional profiling

•Better functional annotations will help•Testing models will increase our sensitivity•Functions and pathways are correlated (higher levels of organization).

In general (systems) biology is behind. Our questions g ( y ) gy qmust be inspired directly by biology

Successful reception by the

GEPAS: currently is the most cited web-based

scientific community y

platform for transcriptomic analysis (482 scholar google citations)Babelomics. Third most cited platform (575 scholar google citations; FatiGO is amongst the 50 mostgoogle citations; FatiGO is amongst the 50 most cited papers in Bioinformatics)

Microarray data analysis webtools with at least 10 citations1.

Web tool URL Citations1

GEPAS http://www.gepas.org 482ExpressionProfiler http://www.ebi.ac.uk/expressionprofiler 52

GEDA h //bi i f i d /GEDA h l 36

Approximately1000 users per day

caGEDA http://bioinformatics.upmc.edu/GEDA.html 36GenePublisher http://www.cbs.dtu.dk/services/GenePublisher 25ExpressYourself http://bioinfo.mbb.yale.edu/expressyourself 26RACE http://race unil ch/ 22

y1500 registered users (6 months)

PublicationsRACE http://race.unil.ch/ 22ArrayPipe http://www.pathogenomics.ca/arraypipe 20VAMPIRE http://genome.ucsd.edu/microarray/ 19MIDAW http://muscle.cribi.unipd.it/midaw/ 15

2008 – 62007 – 62006 – 52005 – 5

t-profiler http://www.t-profiler.org 16CARMAweb https://carmaweb.genome.tugraz.at 121) Scholar Google citations over all the references of the tool (June 2008).

Functional GenomicsSNP analysis

PupaSuite

y

Interactive selection of optimal sets of SNPs for large-scale genotyping

SNPeffect database.Phenotyping of human SNPs and disease mutations

Genome projects. Design, and implementation of:implementation of:

Workflow for High-throughput genotyping at CeGenCeGen.

Problem 1: feed the monster. E.g. Illumina: 150.000

genotipes at a i

Problem 2: store

results...

Problem 3: query

the database

time

Computer-aided

database...

Cancer

Experimental design (li k

selection. PupaSuite

Conde et al. 2004 2005

Cancer SNPs DB server

(linkage, pathway,

etc)

2004, 2005, 2007 NAR

and ...along with

clinical data

...and submit to analysis

programs

LD, Case-control, haplotypes, ODD

ratios, etc.

October 2004: 45.000 SNPs designed

Functional GenomicssiRNA analysisy

Si-RNA DEsign

SiDEHighly specific and accurate selection of siRNAs for high-throughput functional assays

http://bioinfo.cipf.es

Next generation sequencing:throughput up to 109bp/day

Illumina Genome Analyzer (Solexa)

Genome Sequencer FLX System“454”

SOLID. Applied Biosystems

Re-sequencing, de novo sequencing, CNVs, SNPs, transciptomics, Chip-on-chip, etc.

Next Generation SequencingNext Generation Sequencing

For:

•Transcriptomics•Resequencing

•SNPs•Copy number •Chip on chip•Chip-on-chip like

Technological services

Sequencing Microarray Computing

Applied Biosystems 3730XL DNA analyzer1 6 Mbp pb per day

Labelling, Hybridization, Scan

Commertial arrays

Computing cluster230 CPU’s20 servers x2 xeon quad-core20 2 t

Facility Facility Power

1.6 Mbp pb per day

SequencingPlasmidCosmidBAC ends

Commertial arraysAglientOperonEppendorfClontechGE HealthcareDNA microarrays

20 servers x 2 opteron process30 pcs x1 Athlon proces

20 Tbs disk storage

x86_64 arquitechtureSNPsmicrosatellitemethylation profilesfragment analysis

DNA microarrays

ChIp on ChipExon arraysmicroRNAsBAC arrays

_ q

0.5 M Blast runs against nr in 24 hours

Array design: probe selection

http://bioinfo.cipf.es

The bioinformatics department at the Centro de Investigación Príncipe Felipe (Valencia, Spain)...Investigación Príncipe Felipe (Valencia, Spain)...

Joaquín DopazoEva AllozaEva AllozaLeonardo ArbizaFátima Al-ShahrourEmidio Capriotti

...the INB, National Institute of Bioinformatics(Functional Genomics Node) and the CIBER-ER Emidio Capriotti

Jose CarbonellAna ConesaHernán Dopazo

( )Nertwork of Centers for Rare Diseases

Hernán DopazoPablo EscobarFrancisco GarcíaStefan GoetzJaime HuertaRafael JimenezMarc MartíIgnacio MedinaPablo MinguezDavid MontanerFrançois SerraJoaquín Tárraga