111/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Lecture 38 Review: Microarrays...
-
Upload
darren-woods -
Category
Documents
-
view
216 -
download
1
Transcript of 111/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Lecture 38 Review: Microarrays...
1BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
BCB 444/544
Lecture 38
Review: Microarrays
Proteomics
#38_Nov28
Thanks to Doina Caragea, KSU
2BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
3 √Mon Nov 26 - Lecture 37Clustering & Classification Algorithms
• Chp 18 Functional Genomics
2 Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions
• Chp 19 Proteomics
Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett
Dancik) http://www.r-project.org/
1 Fri Dec 1 - Lecture 39 (Last Lecture!)Systems Biology (& a bit of Metabolomics & Synthetic Biology)
Required Reading (before lecture)
3BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Assignments & Announcements
Mon Nov 26 - HW#6 Due (5 PM Mon Nov 26 or ASAP)
Mon Dec 3 - BCB 544 Project Reports Due (NO CLASS that day!!)
ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!!
Tentative Schedule:Wed Dec 5: #!: Xiong & Devin (~20’) #2: Tonia (10-15’)Fri Dec 7: #3: Kendra & Drew (~20’) #4: Addie (10-15’)
Thurs Dec 6 - Optional Review Session for Final Exam
Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)
Will include: 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive
40 pts In Lab Practical (Comprehensive)
4BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
http://www.bcb.iastate.edu/seminars/index.html
Nov 29 Thurs - Baker Center Seminar 2:10Howe Hall Auditorium,• Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic
Approach
Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB• Sue Gibson Univ. of Minnesota• How do soluble sugar levels help regulate plant development,
carbon partitioning and gene expression?
Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI• Shashi Gadia ComS, ISU • Harnessing the Potential of XML
Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB• John Abrams Univ Texas Southwestern Medical Center• Dying Like Flies: Programmed & Unprogrammed Cell Death
5BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Chp 18 – Functional Genomics
SECTION V GENOMICS & PROTEOMICS
Xiong: Chp 18 Functional Genomics
• Sequence-based Approaches
• Microarray-based Approaches• Comparison of SAGE & DNA Microarrays
6BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Gene Expression Analysis
7BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Pattern Recognition in Microarray Analysis
• Clustering (unsupervised learning)• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)• Uses known groups of interest (from other sources) to
learn features associated with these groups in primary data and create rules for associating data with groups of interest
8BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Microarray Analysis - Questions & Answers
• How do hierarchical clustering algorithms work?
• How do we measure the distance between two clusters? (similarity criteria)
• Single link• Complete link• Average link
• What are “good clusters”?• Big difference between INTRA-cluster distance and INTER-
cluster distance, i.e., INTRA-cluster distance is minimized while INTER-cluster distance is maximized
• What are pros & cons of:• Hierarchical vs K-means clustering• Clustering vs Classification
9BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Clustering Metrics
• A key issue in clustering is to determine what similarity / distance metric to use• Often, such metric has a bigger effect on the
results than actual clustering algorithm used!• When determining the metric, we should take
into account our assumptions about the data and the goal of the clustering
10BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
How Determine Distances?
Intra-cluster distance• Min/Max/Avg the distance
between- All pairs of points in the
cluster OR- Between centroid and
all points in the cluster
Inter-cluster distance• Single link • distance between two most
similar members
• Complete link• distance between two most
similar members
• Average link• Average distance of all pairs
• Centroid distance
What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia
11BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
INTRA- vs INTER-Cluster Distances
Good! Bad!
12BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Methods for Clustering(Unsupervised Learning)
• Hierarchical Clustering
• K-Means • Self Organizing Maps • (in lab, won’t discuss in lecture)
• …many others….
13BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Hierarchical Clustering*
• Probably most popular clustering algorithm for microarray analysis• First presented in this context by Eisen et al. in 1998• Nodes = genes or groups of genes
Agglomerative (bottom up)0. Initially each item is a cluster1. Compute distance matrix2. Find two closest nodes (most similar
clusters) 3. Merge them4. Compute distances from merged node to
all others5. Repeat until all nodes merged into a
single node
*This method was illustrated in Lecture 36,Tables 6.1-MM6.4
14BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07Copyright: Russ Altman
15BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Hierarchical Clustering: Strengths & Weaknesses
• Easy to understand & implement • Can decide how big to make clusters by
choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima
Bottom-up is most commonly used method • Can also perform top-down, which requires
splitting a large group successively
16BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
K-Means Clustering (Model-based)
Computationally attractive!
1. Choose random points (cluster centers or centroids) in k dimensions
2. Compute distance from each data point to centroids
3. Assign each data point to closest centroid
4. Compute new cluster centroid as average of points assigned to cluster
5. Loop to (2), stop when cluster centroids do not move very much For K = 2
Two features: f1 (x-coordinate) & f2 (y-coordinate)
InitialCentroid A
Initial Centroid B
2ndCentroid A
2nd Centroid B
17BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
K-Means Clustering Example, for k=2
Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 71. Randomly select 5 and 6 as centers (centroids) Calculate distance from points to centroids &
assign points to clusters: {1,2,5} & {6,7} Compute new cluster centroids:
(C1) = 8/3 = 2.7
(C2) = 13/2= 6.54. Calculate distance from points to new centroids &
assign data points to new clusters: {1,2} & {5,6,7}
5. Compute new cluster centroids: (C1) = 1.5
(C2) = 6.06. No change? Converged!
=> Final clusters = {1,2} & {5,6,7} 25
1 671.5
25
1 672.7
6.5
25
1 672.7
6.5
25
1 67
25
1 67
For simplicity, assume k=2 & objects are 1-dimensional(Numerical difference is used as distance)
18BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
K Means Clustering for k=2A more realistic example
Pick seeds
Assign clusters
Compute centroids
xx
Re-assign clusters
xx xx Compute centroids
Re-assign clusters
Converged!
From S. Mooney
19BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
K-Means Clustering:Strengths & Weaknesses
• Fast, O(N) • Hard to know which K to choose • Try several and assess cluster
quality • Hard to know where to seed the
clusters • Results can change drastically
with different initial choices for centroids - as shown in example:
In the above, if startwith B and E as centroidswill converge to {A,B,C}and {D,E,F}If start with D and FWill converge to {A,B,D,E} {C,F}
Example IllustratingSensitivity to Seeds
20BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Choice of K? Helpful to have additional information to aid evaluation of clusters
21BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Hierarchical Clustering vs K-Means
Hierarchical Clustering
K-Means
Running Time Slower Faster
AssumptionsRequires
distance metricRequires
distance metric
Parameters NoneK (number of
clusters)
ClustersSubjective
(only a tree is returned)
Exactly K clusters
22BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Clustering vs Classification
• Clustering (unsupervised learning)• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)• Uses known groups of interest (from other sources) to
learn features associated with these groups in primary data and create rules for associating data with groups of interest
23BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Classification: Supervised Learning Task
• Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient)
Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class
• Do: Learn a model that accurately predicts class based on features
• Outcome: Predict class value of a patient based on expression levels of his/her genes
24BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Methods for Classification
• K-nearest neighbors (KNN)
• Linear Models • Logistic Regression • Naive Bayes• Decision Trees• Support Vector Machines
25BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
K-Nearest Neighbor (KNN)
• Idea: Use k closest neighbors to label new data points (e.g., for k = 4)
26BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Basic KNN Algorithm
INPUT: • Set of data with labels (training data)• K • Set of data needing labels • Distance metric
1. For each unlabeled data point, compute distance to all labeled data
2. Sort distances, determine closest K neighbors (smallest distances)
3. Use majority voting to predict label of unlabeled data point
27BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Variations on KNN
• Can classify into multiple classes easily • Weighted KNN - an weight votes of nearby
training samples based on their distance from unknown sample• Can set a threshold, p, for the # of votes
needed to win. (If no winner, then either NULL result or set default winner)
28BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Compare in Graphical Representation
Apply external labels: RED group & BLUE group
ClassificationClustering
29BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Tradeoffs for Clustering vs Classification
• Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters
• Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge
30BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Chp 19 – Proteomics
SECTION V GENOMICS & PROTEOMICS
Xiong: Chp 19 Proteomics
• Technology of Protein Expression Analysis• Post-translational Modification• Protein Sorting• Protein-Protein Interactions
31BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Facilities:Proteomics Facility (Carver Co-lab)
http://www.plantgenomics.iastate.edu/proteomics/Protein Facility (MBB)
http://www.protein.iastate.edu/
Experiments:Plant: Rodermel, Wise, VoytasAnimal: Greenlee, perhaps others soon?
Computational Analysis:Honavar, Wise, Dobbs
ISU Proteomics Resources & Researchers
32BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Proteomics: What do all those proteins do??
Copyright © 2006 A. Malcolm Campbell
Biological processes for yeast proteins
33BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Proteome Analysis: “Traditionally”
using Two-dimensional (2D) gels
Copyright © 2006 A. Malcolm Campbell
1st D: Isoelectric focusing (IEF) in pH gradient: Proteins migrate to isoelectric points & stop moving
2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis):
Proteins migrate according to molecular weight
3411/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Proteins identified on 2D gels (IEF/SDS-PAGE)
Direct protein microsequencing by Edman degradations-- done at facilities (here at ISU) -- typically need 5 picomoles-- often get 10 to 20 amino acids of sequence
Protein mass analysis by MALDI-TOF-- Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Spectroscopy-- done at facilities (here at ISU)-- often detect post-translational modifications
(such as phosphorylated Ser, Thr, Tyr)
Page 250-1
3511/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Evaluation of 2D gels (IEF/SDS-PAGE)
Advantages:Visualize hundreds to thousands of proteinsImproved identification of protein spots
Disadvantages:Limited number of samples can be processedMostly abundant proteins visualizedTechnically difficult
Page 251Jonathan Pevsner
36BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
Tandem Mass Spectrometry (TS) to Identify Proteins
Copyright © 2006 A. Malcolm Campbell
Figure 8.19 Tandem mass spectrometry for protein identification
a) ESI creates ionized proteins, represented by colored shapes with positive charges. Each shape represents many copies of identical proteins.
b) Ionized proteins are separated based on their mass to charge ratio (m/z) and sent one at a time into the activation chamber. Separation and selection take place in the first of the two MS devices. The solid purple protein has been selected for analysis; the other three are temporarily stored for later analysis.
c) The group of m/z selected ionized proteins enters a collision cell that is filled with inert argon gas. Gas molecules collide with proteins, which causes them to break into two peptide pieces (labeled b and y).
d) Ionized peptide pieces are sent into second MS device, which again measures the m/z ratio. A computer compares spectrum of peptide pieces to a database of ideal spectra to identify the original group of identical proteins.
37BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07
MS data: Protein identification through peptide fragment identification & separation
Copyright © 2006 A. Malcolm Campbell
Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses (gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and protein identity.
3811/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Databases of 2D Gel Informationhttp://ca.expasy.org/ch2d/2d-index.html
3911/28/07BCB 444/544 F07 ISU Dobbs #38 - ProteomicsJonathan Pevsner