Visualization Approaches for Gene Expression Data
description
Transcript of Visualization Approaches for Gene Expression Data
March 4, 2010 1
Visualization Approaches forGene Expression Data
Matt HibbsAssistant Professor
The Jackson Laboratory
March 4, 2010 2
Transcriptomics & Gene Expression
• Simultaneous measurement of transcription for the entire genome
• Useful for broad range of biological questions DNA
mRNA
Proteins
Ribosome
Transcription
Translation
March 4, 2010 3
Outline• Technologies & Specific Concerns
– cDNA microarrays (2-color & 1-color arrays)– RNA-seq
• Normalization visualizations• Full data displays• Dimensionality reduction• Sequence-order displays• Comparative visualization• Future Directions
March 4, 2010 4
Technology: 2-color cDNA Microarrays
Spot slide with known sequences
Add mRNA to slide for Hybridization
Scan hybridized array
reference mRNA test mRNA
add green dye add red dye
hybridizeA 1.5
B 0.8
C -1.2
D 0.1
A
CB
D
A
CB
D
A
CB
D
March 4, 2010 5
Technology: 2-color cDNA Microarrays
March 4, 2010 6
Technology: RNA-seq
Image from WikiMedia
March 4, 2010 7
Normalization: MA-plot
• Need to account for intensity bias between channels (red/green, or mult. 1-color)
• MA-plot (also called RI-plot) shows relationship between ratio and intensity
March 4, 2010 8
Normalization: Box-Whisker Quantile
• Quantile normalization often used to adjust for between chip variance
• Box-Whisker plots typically used to visualize the process
March 4, 2010 9
Full Data Displays• Techniques to show all of the data at
once• Heat Maps
– Displays numerical values as colors– Good to see all data intuitively– Requires clustering to see patterns
• Parallel Coordinates– Line plots of high-dimensional data– Easy to see/select trends or patterns– Esp. good for course data (time, drug, etc.)
March 4, 2010 10
Heat Maps
Under-Expressed Over-Expressed
ClusterRasterize
…
…
0 +3-3
March 4, 2010 11
Heat Maps: Stats• Clustering important to see patterns
– Hierarchical, K-means, SOM, etc…– Choice of distance metric in addition to
method
• Match the visualization mapping to the statistics used for analysis– Coloration based on actual numbers
appropriate for Euclidian distance measures
– Centered or normalized measures should use corresponding colorings
March 4, 2010 12
Heat Maps: Distance Metrics
Euclidean Distance
Pearson Correlation
Spearman Correlation
March 4, 2010 13
Heat Maps: Stats
Data clustered using a rank-based statistic
lowest value highest value
March 4, 2010 14
Heat Maps: Overview + Detail
Java TreeView, Saldanha et al.Data from Spellman et al., 1998
March 4, 2010 15
Parallel Coordinates• View expression vectors as lines
– X-axis = conditions– Y-axis = value
Time Searcher, Hochheiser et al.
March 4, 2010 16
Parallel Coordinates
Time Searcher, Hochheiser et al.
• Selection and Interaction methods can answer specific questions
• Brushing techniques to select patterns
• Cluttered displays for large datasets, limited number of conditions effectively shown
March 4, 2010 17
Dimensionality Reduction• Project data from large, high
dimensional space to a smaller space (usually 2 or 3 D)
• Several techniques:– SVD & PCA– Multidimensional scaling
• Once projected into lower dimension, use standard 2D (or 3D) techniques
March 4, 2010 18
Dimensionality Reduction
March 4, 2010 19
Dimensionality Reduction: SVD
…
…Transform original data vectors into an orthogonal basis that captures decreasing amounts of variation
March 4, 2010 20
Dimensionality Reduction: SVD
SVD
March 4, 2010 21
SVD Example
G1
S
G2
M
M/G1
Legend
GeneVAnD, Hibbs et al.Data from Spellman et al., 1998
March 4, 2010 22
Sequence-based Visualization• View data in chromosomal order
– Copy number variation & aneuploidies• common in cancers & other disorders
– Competitive Genomic Hybridization (CGH)
– mRNA sequencing (RNA-seq)– Borrows concepts from genome
browsers
March 4, 2010 23
Sequence-based: CGH• Karyoscope plots
Java TreeView, Saldanha et al.
March 4, 2010 24
Sequence-based: RNA-seq
IGV, http://www.broadinstitute.org/igv
March 4, 2010 25
Comparative Visualization• Using multiple simultaneous
complementary views of data• Each scheme emphasizes different
aspects – use multiple to show overall picture
• Show multiple, related datasets to identify common and unique patterns
March 4, 2010 26
Comparative Visualization: Single Dataset
MeV, Saeed et al.
March 4, 2010 27
Comparative Visualization: Single Dataset
Spotfire
GeneSpring
March 4, 2010 28
Comparative Visualization: Multi-dataset
Dendrogram
Heat Map Overview
HIDRA
Data from Spellman et al., 1998
Hibbs et al.
March 4, 2010 29
Comparative Visualization: Multi-dataset
HIDRA
Selection
SynchronizedDetails
Data from Spellman et al., 1998
Hibbs et al.
March 4, 2010 30
Comparative Visualization: Multi-dataset
HIDRA
Selection
Data from Spellman et al., 1998
Hibbs et al.
March 4, 2010 31
Summary & Tools• R & bioconductor• Java TreeView (Saldanha, 2004)• Time Searcher (Hochheiser et al.,
2003)• Integrative Genomics Viewer (IGV;
www.broadinstitute.org/igv)• TIGR’s MultiExperiment Viewer (MeV;
Saeed et al., 2003)• HIDRA (Hibbs et al., 2007)
March 4, 2010 32
Trends & Future Directions• Emphasis on usability and audience
– If a “wet bench” biologist can’t use it…
• Incorporate common statistical analysis techniques with visualizations– e.g. differential expression tests, GO enrichments,
etc.
• Isoforms and Splice variants• New user interaction schemes
– e.g. multi-touch interfaces, large-format displays
• Low level “systems analysis”– linking together multiple types of data into unified
displays
March 4, 2010 33
Acknowledgements
• Hibbs Lab– Karen Dowell– Tongjun Gu– Al Simons
• Olga Troyanskaya Lab– Patrick Bradley– Maria Chikina– Yuanfang Guan
• Chad Myers• David Hess• Florian Markowetz• Edo Airoldi• Curtis Huttenhower
• Kai Li Lab– Grant Wallace
• Amy Caudy
• Maitreya Dunham
• Botstein, Kruglyak, Broach, Rose labs
• Kyuson Yun
• Carol Bult
March 4, 2010 34
The Center for Genome Dynamics at The Jackson Laboratory
www.genomedynamics.org
Investigators use computation, mathematical modeling and statistics, with a shared focus on the genetics of complex traits
Requires PhD (or equivalent) in quantitative field such as computer science, statistics, applied mathematics or in biological sciences with strong quantitative backgroundProgramming experience recommended
The Jackson Laboratory was voted #2 in a poll of postdocs conducted by The Scientist in 2009 and is an EOE/AA employer
Postdoctoral Opportunities Postdoctoral Opportunities inin
Computational & Systems Computational & Systems Biology Biology