111/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Lecture 38 Review: Microarrays...

1BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07

BCB 444/544

Lecture 38

Review: Microarrays

Proteomics

#38_Nov28

Thanks to Doina Caragea, KSU


3 √Mon Nov 26 - Lecture 37Clustering & Classification Algorithms

• Chp 18 Functional Genomics

2 Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions

• Chp 19 Proteomics

Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett

Dancik) http://www.r-project.org/

1 Fri Dec 1 - Lecture 39 (Last Lecture!)Systems Biology (& a bit of Metabolomics & Synthetic Biology)

Required Reading (before lecture)


Assignments & Announcements

Mon Nov 26 - HW#6 Due (5 PM Mon Nov 26 or ASAP)

Mon Dec 3 - BCB 544 Project Reports Due (NO CLASS that day!!)

ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!!

Tentative Schedule:Wed Dec 5: #!: Xiong & Devin (~20’) #2: Tonia (10-15’)Fri Dec 7: #3: Kendra & Drew (~20’) #4: Addie (10-15’)

Thurs Dec 6 - Optional Review Session for Final Exam

Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)

Will include: 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive

40 pts In Lab Practical (Comprehensive)


Seminars this Week

BCB List of URLs for Seminars related to Bioinformatics:

http://www.bcb.iastate.edu/seminars/index.html

Nov 29 Thurs - Baker Center Seminar 2:10Howe Hall Auditorium,• Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic

Approach

Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB• Sue Gibson Univ. of Minnesota• How do soluble sugar levels help regulate plant development,

carbon partitioning and gene expression?

Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI• Shashi Gadia ComS, ISU • Harnessing the Potential of XML

Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB• John Abrams Univ Texas Southwestern Medical Center• Dying Like Flies: Programmed & Unprogrammed Cell Death


Chp 18 – Functional Genomics

SECTION V GENOMICS & PROTEOMICS

Xiong: Chp 18 Functional Genomics

• Sequence-based Approaches

• Microarray-based Approaches• Comparison of SAGE & DNA Microarrays


Gene Expression Analysis


Pattern Recognition in Microarray Analysis

• Clustering (unsupervised learning)• Uses primary data to group measurements, with no

information from other sources

• Classification (supervised learning)• Uses known groups of interest (from other sources) to

learn features associated with these groups in primary data and create rules for associating data with groups of interest


Microarray Analysis - Questions & Answers

• How do hierarchical clustering algorithms work?

• How do we measure the distance between two clusters? (similarity criteria)

• Single link• Complete link• Average link

• What are “good clusters”?• Big difference between INTRA-cluster distance and INTER-

cluster distance, i.e., INTRA-cluster distance is minimized while INTER-cluster distance is maximized

• What are pros & cons of:• Hierarchical vs K-means clustering• Clustering vs Classification


Clustering Metrics

• A key issue in clustering is to determine what similarity / distance metric to use• Often, such metric has a bigger effect on the

results than actual clustering algorithm used!• When determining the metric, we should take

into account our assumptions about the data and the goal of the clustering


How Determine Distances?

Intra-cluster distance• Min/Max/Avg the distance

between- All pairs of points in the

cluster OR- Between centroid and

all points in the cluster

Inter-cluster distance• Single link • distance between two most

similar members

• Complete link• distance between two most

similar members

• Average link• Average distance of all pairs

• Centroid distance

What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia


INTRA- vs INTER-Cluster Distances

Good! Bad!


Methods for Clustering(Unsupervised Learning)

• Hierarchical Clustering

• K-Means • Self Organizing Maps • (in lab, won’t discuss in lecture)

• …many others….


Hierarchical Clustering*

• Probably most popular clustering algorithm for microarray analysis• First presented in this context by Eisen et al. in 1998• Nodes = genes or groups of genes

Agglomerative (bottom up)0. Initially each item is a cluster1. Compute distance matrix2. Find two closest nodes (most similar

clusters) 3. Merge them4. Compute distances from merged node to

all others5. Repeat until all nodes merged into a

single node

*This method was illustrated in Lecture 36,Tables 6.1-MM6.4

14BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07Copyright: Russ Altman


Hierarchical Clustering: Strengths & Weaknesses

• Easy to understand & implement • Can decide how big to make clusters by

choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima

Bottom-up is most commonly used method • Can also perform top-down, which requires

splitting a large group successively


K-Means Clustering (Model-based)

Computationally attractive!

1. Choose random points (cluster centers or centroids) in k dimensions

2. Compute distance from each data point to centroids

3. Assign each data point to closest centroid

4. Compute new cluster centroid as average of points assigned to cluster

5. Loop to (2), stop when cluster centroids do not move very much For K = 2

Two features: f1 (x-coordinate) & f2 (y-coordinate)

InitialCentroid A

Initial Centroid B

2ndCentroid A

2nd Centroid B


K-Means Clustering Example, for k=2

Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 71. Randomly select 5 and 6 as centers (centroids) Calculate distance from points to centroids &

assign points to clusters: {1,2,5} & {6,7} Compute new cluster centroids:

(C1) = 8/3 = 2.7

(C2) = 13/2= 6.54. Calculate distance from points to new centroids &

assign data points to new clusters: {1,2} & {5,6,7}

5. Compute new cluster centroids: (C1) = 1.5

(C2) = 6.06. No change? Converged!

=> Final clusters = {1,2} & {5,6,7} 25

1 671.5

25

1 672.7

6.5

25

1 672.7

6.5

25

1 67

25

1 67

For simplicity, assume k=2 & objects are 1-dimensional(Numerical difference is used as distance)


K Means Clustering for k=2A more realistic example

Pick seeds

Assign clusters

Compute centroids

xx

Re-assign clusters

xx xx Compute centroids

Re-assign clusters

Converged!

From S. Mooney


K-Means Clustering:Strengths & Weaknesses

• Fast, O(N) • Hard to know which K to choose • Try several and assess cluster

quality • Hard to know where to seed the

clusters • Results can change drastically

with different initial choices for centroids - as shown in example:

In the above, if startwith B and E as centroidswill converge to {A,B,C}and {D,E,F}If start with D and FWill converge to {A,B,D,E} {C,F}

Example IllustratingSensitivity to Seeds


Choice of K? Helpful to have additional information to aid evaluation of clusters


Hierarchical Clustering vs K-Means

Hierarchical Clustering

K-Means

Running Time Slower Faster

AssumptionsRequires

distance metricRequires

distance metric

Parameters NoneK (number of

clusters)

ClustersSubjective

(only a tree is returned)

Exactly K clusters


Clustering vs Classification

• Clustering (unsupervised learning)• Uses primary data to group measurements, with no

information from other sources

• Classification (supervised learning)• Uses known groups of interest (from other sources) to

learn features associated with these groups in primary data and create rules for associating data with groups of interest


Classification: Supervised Learning Task

• Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient)

Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class

• Do: Learn a model that accurately predicts class based on features

• Outcome: Predict class value of a patient based on expression levels of his/her genes


Methods for Classification

• K-nearest neighbors (KNN)

• Linear Models • Logistic Regression • Naive Bayes• Decision Trees• Support Vector Machines


K-Nearest Neighbor (KNN)

• Idea: Use k closest neighbors to label new data points (e.g., for k = 4)


Basic KNN Algorithm

INPUT: • Set of data with labels (training data)• K • Set of data needing labels • Distance metric

1. For each unlabeled data point, compute distance to all labeled data

2. Sort distances, determine closest K neighbors (smallest distances)

3. Use majority voting to predict label of unlabeled data point


Variations on KNN

• Can classify into multiple classes easily • Weighted KNN - an weight votes of nearby

training samples based on their distance from unknown sample• Can set a threshold, p, for the # of votes

needed to win. (If no winner, then either NULL result or set default winner)


Compare in Graphical Representation

Apply external labels: RED group & BLUE group

ClassificationClustering


Tradeoffs for Clustering vs Classification

• Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters

• Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge


Chp 19 – Proteomics

SECTION V GENOMICS & PROTEOMICS

Xiong: Chp 19 Proteomics

• Technology of Protein Expression Analysis• Post-translational Modification• Protein Sorting• Protein-Protein Interactions


Facilities:Proteomics Facility (Carver Co-lab)

http://www.plantgenomics.iastate.edu/proteomics/Protein Facility (MBB)

http://www.protein.iastate.edu/

Experiments:Plant: Rodermel, Wise, VoytasAnimal: Greenlee, perhaps others soon?

Computational Analysis:Honavar, Wise, Dobbs

ISU Proteomics Resources & Researchers


Proteome Analysis: “Traditionally”

using Two-dimensional (2D) gels


1st D: Isoelectric focusing (IEF) in pH gradient: Proteins migrate to isoelectric points & stop moving

2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis):

Proteins migrate according to molecular weight

3411/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics

Proteins identified on 2D gels (IEF/SDS-PAGE)

Direct protein microsequencing by Edman degradations-- done at facilities (here at ISU) -- typically need 5 picomoles-- often get 10 to 20 amino acids of sequence

Protein mass analysis by MALDI-TOF-- Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Spectroscopy-- done at facilities (here at ISU)-- often detect post-translational modifications

(such as phosphorylated Ser, Thr, Tyr)

Page 250-1


Evaluation of 2D gels (IEF/SDS-PAGE)

Advantages:Visualize hundreds to thousands of proteinsImproved identification of protein spots

Disadvantages:Limited number of samples can be processedMostly abundant proteins visualizedTechnically difficult

Page 251Jonathan Pevsner


Tandem Mass Spectrometry (TS) to Identify Proteins


Figure 8.19 Tandem mass spectrometry for protein identification

a) ESI creates ionized proteins, represented by colored shapes with positive charges. Each shape represents many copies of identical proteins.

b) Ionized proteins are separated based on their mass to charge ratio (m/z) and sent one at a time into the activation chamber. Separation and selection take place in the first of the two MS devices. The solid purple protein has been selected for analysis; the other three are temporarily stored for later analysis.

c) The group of m/z selected ionized proteins enters a collision cell that is filled with inert argon gas. Gas molecules collide with proteins, which causes them to break into two peptide pieces (labeled b and y).

d) Ionized peptide pieces are sent into second MS device, which again measures the m/z ratio. A computer compares spectrum of peptide pieces to a database of ideal spectra to identify the original group of identical proteins.


MS data: Protein identification through peptide fragment identification & separation


Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses (gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and protein identity.


Databases of 2D Gel Informationhttp://ca.expasy.org/ch2d/2d-index.html

3911/28/07BCB 444/544 F07 ISU Dobbs #38 - ProteomicsJonathan Pevsner

111/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Lecture 38 Review: Microarrays...

Documents

Transcript of 111/28/07BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Lecture 38 Review: Microarrays...