Mining and Pattern Analysis in Large Data Sets for Biological Information.€¦ · Data Sets for...
Transcript of Mining and Pattern Analysis in Large Data Sets for Biological Information.€¦ · Data Sets for...
Mining and Pattern Analysis in Large
Data Sets for Biological Information.
David W. Mount
Arizona Cancer Center
• Analysis of gene expression microarray data
sets with goal of preventing or curing cancer
– Statistical analysis of data
– Using biological information to interpret
data
• Future types of genetic analyses
My major objectives.
Develop hypotheses based on data analysis that can betested in the laboratory or clinic
Use and develop new methods for data analysis - patternanalysis, clustering, data mining, biological models
Focus 1: early changes in colorectal and prostate cancer
Focus 2: drugs for pancreatic cancer
Major goal: to discover the unusual based on statisticaland biological data analyses
C3Y
labeled
cDNA
matched
oligos
mismatched
oligos
cDNA,
EST
collection
oligo 1, oligo2, oligo3,….for each
gene
control
sample
mRNA
test
sample
mRNA
synthesis
on slide
C5Y
labeled
cDNA control
sample
mRNA
to one
slide
test
sample
mRNA to
another slide
biotin
labeled cDNAs
hybridized to
oligos
Cy5/Cy3
for each
gene
slide1/slide2
for each
gene
mix hybridized
to one slide
Using data from two types of microarrays for measuring
gene expression of ~35,000 human genes.
Spotted
arrays
Affymetrix
arrays
control
sample
mRNA to one slide
Green – down or
Red – up <2-4 fold
NAP normal adjacent
MET metastatic
PCA localized
BPH benign hyperplasia
Using gene
expression
microarrays for
predicting genetic
variation in tissues.
- Michigan Prostate Study
Underexpressedpredicting lost functions
Overexpressedpredicting newmetabolism�
Use data to find
• An unusual gene product or gene
expression value that indicates a
good drug target
• An early change that can help
with early detection/diagnosis
Microarrays provide new drug targets
- 1Over-expressed genes in metastatic tissue. What genes,what pathways, what functions, where in cell? Cancercells need these additional proteins to support their abnormal metabolism.
Cancer cellNormal cell A AAA
Inhibitor of Aproduct
Cancer cells lose many gene functions by mutation (A-).They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlappinggene functions synthetic lethals (A. Kamb)
A-B+A+B+Normal cell Cancer cell
Inhibitor of Bproduct
Microarrays provide new drug targets
- 2
Careful Experimental Design and
Statistical Analysis are Extremely
Important
1. Plan experiment so as to identify sources of
variation
2. Include biological replication
3. Perform data quality analysis
4. Find genes that are varying significantly
using data model in 1.
5. Mine this gene list for biological information
Complications: genetic variability person to
person, cancer stage, tissues are cell
mixtures
Analysis of biological data with a variable
genetic component is not new!
We are using R statistical computing/BioConductor for data
analysis combined with Perl/Bioperl for biological mining.
R has tools for looking at data quality, etc..
Background varies slide to slide
bad spotted array good affy array
Antibody used for immunochemical stain reveals
which cells are producing a protein (cytokeratin)
Labeled cells
Unlabeled cells
Example of Pancreatic Cancer
• 1/200 people get pancreatic cancer; 1/4 if have
pancreatitis
• It is a very painful and debilitating disease
• Death usually within 1-2 years of discovery
• Few drugs available - gemcitabine hopeful ut only
helps small percentage of people
• There is very little currently being spent on research
into pancreatic cancer compared to other cancers
• I will describe early results: 4 cancer tissues vs one
normal tissue on Aglilent spotted arrays (24K genes).
Boxplots of normalized data of 4 tissues reveal
between slide comparisons should be valid.
Boxplots illustrate that distribution of M values in each
sample is similar. Bars are 25% and 75% levels.
Normalization within arrays corrects for
labeling and label detection variation.
MA plot with no normalization MA plot with Loess normalization
Red - tumor
Blue - normal
A = average of R
and G values
(square root of
their product)
M = log of R to
G ratio to the
base 2.
MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes.
The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with
a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0;
and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in
these tissues are also shown. Normalization restores M of most genes to approx. 0.
Top 100 genes that are statistically best
supported are mostly down regulated.
Red - tumor
Blue - normal
A1 = average of R
and G values (square
root of their product)
M1 = log of R to G
ratio to the base 2.
Volcano plot of fold change (x axis) against log odds that gene is
differentially expressed (y axis) for 100 most significantly varying
genes.
This plot also shows that the
most significantly varying genes
in the pancreatic cancer tissues
are down regulated, which
probably means they are not
functional. Some down
regulated genes are also tumor
suppressor genes and thus are
candidates for project 2 drug
screens in the Pancreatic PPG.
Log odds of 5 means that
the chance that these genes
are NOT varying significantly
from M=0 is e5 = 1/148. This
is a measure of the false
discovery rate.
Example of genes varying significantly between 4 pancr.
cancer tissues and a normal pancr. tissue sample. -TGen
data - Agilent arrays.
Gb_accession GeneName DescriptionM =
log2(R/G)
A =
RG tp corr.
for FDR B
BC004490 FOSV-fos transcr.
factor 3.6 10.3 36.9 0.00090 7.41
NM_033194 NM_033194.1 Heat shock pr B9 -1.7 8.2 -30.0 0.00090 6.64
Y12661 VGFVGF nerve
growth factor -2.5 13.7 -28.3 0.00090 6.41
AF488739 GABABLfamily G protein
coupled rec. -2.0 10.2 -26.1 0.00090 6.06
……
NM_015711 GLTSCR1Glioma tumorsuppressor -1.0 10.3 -15.3 0.00188 3.51
……
BC000311 COPEBKruppel-like
transcr. factor 1.6 10.8 13.8 0.00250 2.99
NM_006999 POLS DNA Poly. sigma 0.8 9.0 13.8 0.00250 2.98
……
NM_001530 HIF1A
Hypoxia-ind
factor 1 1.4 7.4 7.6 0.00865 -0.19
NM_001530 HIF1A
Hypoxia-indfactor 1 1.6 7.3 7.6 0.00867 -0.20
NM_001530 HIF1A
Hypoxia-ind
factor 1 1.5 7.3 7.6 0.00872 -0.21
NM_001530 HIF1A
Hypoxia-ind
factor 1 1.6 7.3 7.5 0.00888 -0.26
p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is theexpected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectlyreported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds ofcorrect prediction is 4.48/1. For B=0, odds = 1/1.
What do you do with a list of
genes?
• Influence on known metabolic and regulatory
pathways (usually ~1/4 of genes)
• Gene Ontology (GO) terms
• Protein-protein and gene-gene interactions
• Where located - genome amplification,
rearrangements?
• Agreement with models - biological and
computational
Local genome databases are
maintained at AZCC
• Local databases of human, rat, mouse, and model organisms
• Direct links to genetic, proteomic, and regulatory/pathwaydatabases
• Information on protein-protein and gene-gene interactions
• http://www.biorag.org is public access Web site
Pathway Miner
• http://www.biorag.org/pathway.html
• Pandey et al. 2004 Bioinformatics. 20:2156-8
• Builds genetic network displays based on
regulatory and metabolic relationships
• Produces lists of genes in excel format
Genetic network analysis of pancreatic data with Pathway Miner
- top 800 pancreatic genes - GenMAPP pathways
A java interactive display
that can be filtered in many
ways. Click on gene
names to retrieve all
relevant information and on
edges to view the pathways
in common. Any list of
genes can be uploaded for
analysis.
Five genes in the top 800 are in MAPK, including FOS
About prostate cancer
• Men screened for a serum antigen - PSA
• If levels go up -> biopsy specimens examined for
evidence of cancer (black box -> Gleason score)
• Decision made about prostatectomy (undesirable -
incontinence, sex dysfunction, etc.)
• Survival about 2/3
• Tissues collected from men used for gene expression
analysis using Affymetrix arrays (about 12,500
genes)
Analysis of a large Affy prostate data
set (Singh et al. 2002, Cancer Cell)
• 50 normal tissues
• 52 staged tissues
• Perform BioConductor Linear Models (LIMMA)
analysis
• Trying advanced statistical modeling and
clustering of genes e.g. independent
component analysis (fastICA, MLICA), mixture
models (nlme)
• Test models of penetration, altered
metabolism, etc.
The data set: human Affy
hgu95av2 chip - 1/3 of
genome
• 50 normal prostate tissues
• 52 cancer tissues at different stages
– 29 negative capsule penetration/20 positive
– 13 positive resesection surg. margin/37 negative
– 9 non re-occurring/5 re-occurring
– Gleason score available
– no apparent dissection
– no apparent pairing of N/C samples
Singh et al. Cancer Cell March 2001
Results from prostate data set
• Can find about 600-1,000 genes
changing N/C depending on acceptable
level of FDR
• No significant changes capsular, margin,
or recurrence data (agrees with paper
• What next?
Biocarta pathways
Metabolic changes in prostate cancer - cells
deprived of oxygen depend on these changes.
Cancer cells in general learn to survive with reduced oxygen and they make a factor
(vascular endothelial growth factor or VEGF) that induces growth of blood vessels. This is
clearly observed in gene expression data.
Getting at the unknown gene
relationships.
• Try to identify sets of genes that are regulated
independently of other sets
• A new method is independent component analysis
(vs. principal components analysis, etc)
– Can superimpose regulatory models and build a more
detailed model
• Genes interact to different degrees
• Problem: find sets of genes that are statistically most
different across the tissue samples
• R provides resources
Independent component analysis: suitable
for building and testing regulatory models
samples
ge
ne
s
componentsg
en
es
= X
samples
com
ponents
Do any of these gene groups better
Separate the sample classes C and N?
Matrix A
Matrix SMatrix X
NN…CC
33331111
31331311
31311311
Use of ICA in analysis of
endometrial cancer
Noise
Good separation
SA Saidi et al. Oncogene 2004
Some samples of ICA: objective - can we find a
set to discriminate gene and tissue classes in
prostate ca.?
Hierarchical clustering (complete) of 102 prostate tissue samples
Boxplots of 102 samples after ICA
Another approach - use list of genes that
are of biological interest during early
stages of prostate Ca. and build model.
• 14-3-3 sigma
• actinin
• BP180
• BP230
• cadherin
• catenin
• CD151
• CD44
• CD63
• CD81
• CD9
• connexin 32
• desmocollin
• desmoglein
• desmoplakin
• ehm2
• EWI
• Ezrin
• fascin
• fibulin
• HD1
• keratin
• laminin
• MTA3
• Nanoshomolog 1
• PKC-delta
• plakoglobin
• Plectin
• SNAI1
• tenascin
• vinculin
• ZonaOccludens 1
• ZonaOccludens 2
Genes related to cell adhesion to intracellular matrix
If change these genes - then expect cells to be able to penetrate the capsule and invade
surrounding tissues.
The source of germline
variability in humans
One of my pairsof chromosomes
Maternal
Paternal
What I pass on toour children
Hundreds of thousands of differences in sequenceCalled SNPs - single nucleotide polymorphisms
What my wife passes onto our children
Inheritance is throughhaplotype blocks of 10sto 100s of kbases
Genotype revealed in humans by haplotype structure of 5q31 (Daly et al. 2001)
Goal: relationship between genotype and expression.
Pomp et al. 2004
Large scale
expression analysis
mapped against
genotype.
Conclusions and future plans
• gene expression data are used to identifydrug targets
• further analysis
– ICA analysis - Maximum likelihood method
– Examine all penetration related genes forpossible variation
Acknowledgements
Colleagues at
UMC/AZCC/SWEHSC
• Ritu Pandey, Greg
Thomas, Rob Klein,
Raghavendra Guru
• Dave Alberts
• Anne Cress
• Gene Gerner
• Serrine Lau - SWEHSC
• Clark Lantz
• Ray Nagle
• Garth Powis
• George Tsaprailis and the
proteomics core
• Bernie Futscher and
George Watts of the
genomics core
Colleagues at
Tgen, Phoenix
• Dan Von Hoff
• Jeff Trent
• Phillip Stafford
• Haiyong Han
• Spyro Mousses