BIOL3014 Review Advanced Bioinformatics. Protein Structure.

63
BIOL3014 Review Advanced Bioinformatics

Transcript of BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Page 1: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

BIOL3014 Review

Advanced Bioinformatics

Page 2: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Protein Structure

Page 3: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Proteins are linear polymers that fold up by themselves…mostly.

Page 4: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

The amino acidsThey can be grouped by properties in many ways according to the chemical and physical properties (e.g. size) of the side chain.

Here is one grouping based on chemical properties:

•Basic: proton acceptors•Acidic: proton donors•Uncharged polar: have polar groups like CONH2 or CH2OH

•Nonpolar: tend to be hydrophobic•Weird: proline links to the N in the main chain•Strong: Cysteine can make “disulphide bridges”

Page 5: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Protein Secondary Structure

Page 6: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Alpha Helix

• 3.6 amino acid (residues) per turn

• O(i) hydrogen bonds to N(i+4)

From book…correct?

Wikipedia

Page 7: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Beta Sheet

A. Three strands shown

B. Anti-parallel sheet

C. Parallel sheet

Sheets are usually curved and can even form barrels.

Page 8: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Beta Turns: getting around tight corners

• Steric hindrance determines whether a tight turn is possible

• R3’s side chain is usually Hydrogen (R3 is glycine)

Page 9: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

X-ray crystallography

• Needs crystallized proteins

• Hard to get crystals• Very tough for

hydrophobic (e.g. transmembrane) proteins

• Better accuracy than NMR

• Expensive: $100,000/protein

Page 10: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

NMR spectroscopy

• Protons resonate at a frequency that depends on their chemical environment.

• This can be used to predict structure.

• Does not require crystallization; protein may be in solution.

• Lower resolution than X-ray crystallography

Page 11: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Obtaining secondary structure from sequence

Page 12: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Predict what?

• There are many types of secondary structure.• Which do we want to predict?

– Alpha helix– Beta strand– Beta turn– Random coil– Pi-helices– 310-helices– Type I turns– …

Page 13: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Start with some proteins of known structure

• Get some good X-ray or NMR models of proteins.

• Since we know their tertiary structures, certainly we can assign each residue in each protein a secondary state.

• Or can we?

Page 14: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

DSSP to the rescue!

• In 1983 Kabsch and Sander introduced DSSP (Dictionary of Protein Secondary Structure) …not a typo..

• It automated the assignment of secondary structure from tertiary structure to make it less arbitrary.

Page 15: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Rules

• Chou-Fassman: created tables of breaking/forming propensity and the relative frequency of each residue type in helices and strands.

• Self information (what the identity of a residue tells you about its likely secondary structure state) is not the only thing we can extract from the known structures.– Maybe certain residues have a strong influence (or are strongly

correlated) with what the secondary state is several residues away. So, look at “long-distance” relationships:

• Directional information: information about the conformation at position i carried by the residue at position j, where i≠j, and is independent of the type of residue at position j.

• Pair information: like directional information, but takes account of the type of residue at position j.

Page 16: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Don’t forget about evolution!

• Sequence evolves faster than structure.• So, imagine a position in an alpha helix (or

other conformation) that recently mutated.– If we could find the orthologous residue in the

same protein in other species, those residues would give us a much better picture.

– So, we should look at the distribution of residues at that position, not just the residue in a particular protein.

Page 17: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

PSI-BLAST is often used to get residue distributions

• The simplest way to get an estimate of the distribution of residues at each position in the protein we are trying to predict is to use PSI-BLAST.– PSI-BLAST will output a “profile” containing an estimate of the

residue distribution at each position in the query protein.

– Each column of the profile is a multinomial probability vector.

• The PSI-BLAST profile can be used in place of the protein in prediction rules.

• PSI-BLAST also outputs a multiple alignment, and it, too, can be used in prediction rules.– You could predict the secondary structure for each protein in the

alignment, and choose the “majority” or “average” prediction.

Page 18: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Why use HMMs for transmembrane topology?

• Transmembrane proteins have a simple, repetitive topology.

• The topology can be subdivided into a small set of regions.– Helices– Inside– Outside– Tails/Caps (at ends of

helices)

• The helices tend to have lengths in a limited range.

Page 19: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

HMM design: Modeling sequences of varying lengths

• Self-loops can model sequences of length 1 to infinity: L = [1,…,infinity]

• Each time through the self-loop generates one more letter.

• This 1-state model generates sequences of length L with probability:

Pr(L) = pL-1(1-p).• So, you control the length of

the sequences (sort of…).

p

1-p

Page 20: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Grouping states

• To avoid over-fitting, we want to reduce the number of parameters.– Each emitting state has nineteen free parameters (one for

each amino acid - 1).

• If a group of states are modeling regions with very similar amino acid preferences, why not require that they all use the same parameters?– If you tie n states together, you “save” 19n parameters, so

the model is less prone to over-fitting when you train it.

Page 21: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Generalization

• We want to know how well a model will generalize to data it has never “seen”.

• If we test (measure accuracy) on the same data we trained on:– We overestimate the generalization

accuracy– We will tend to over-fit the training data (by

adjusting the model design to fit it)

Page 22: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Sample questions

1. Obtaining protein secondary structure– a. Define the protein secondary structure task.– b. List five types of secondary structure element.– c. Describe what is meant by the ideas of “self

information”, “directional information” and “pair information” when predicting secondary structure using a sliding-window method.

– d. What is a PSI-BLAST profile and why are they used in secondary structure prediction?

– e. What kinds of proteins are HMMs particularly suited to modeling?

Page 23: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Proteome and Gene Expression Analysis

Page 24: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

The Goals

• Functional Genomics:– To know when, where and how much

genes are expressed.– To know when, where, what kind and how

much of each protein is present.

• Systems Biology:– To understand the transcriptional and

translational regulation of RNA and proteins in the cell.

Page 25: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Measuring Gene Expression

• What we want to do is measure the number of copies of each RNA transcript in a cell at a given point in time.– Extract the RNA from the cell.– Measure each type of transcript quantitatively.

• How do you measure it?– Sequence it in a quantitative way– But sequencing is (used to be) very expensive

• So, use technology and tricks…

Page 26: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Low-throughput Sequencing

• qPCR (also called rtPCR) allows you to accurately measure a given transcript.– But you have to decide which transcript

you want to measure and make primers for it.

– So it is very expensive and low-throughput.

• So the “array technologies” were born…

Page 27: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Gene Arrays

• Put a bunch of different, short single-stranded DNA sequences at predefined positions on a substrate.

• Let the unknown mixture of tagged DNA or RNA molecules hybridize to the DNAs.

• Measure the amount of hybridized material.

Page 28: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Measuring Protein Expression

• In order to measure all the types of protein in a cell we must– Extract the proteins– Purify the proteins– Identify the individual proteins

• How do we accomplish purification and identification of proteins.

Page 29: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

The Technologies:Protein Expression

• Low-throughput– 2D Gel Electrophoresis + Mass Spectrometry– Liquid chromatograph + Mass Spectrometry

• Protein microarrays– Limited in application at this point– Can be used for things other than protein

expression like protein-protein interactions

Page 30: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Separating the Proteins:2D Gel Electrophoresis

• First step: pI/pH– Proteins are introduced to a gel

with an imobilized pH gradient.– A charge is applied.– Proteins migrate until the pH

causes them to lose their charge (isoelectric point) and then stop.

• Second step: mass– First gel transferred to second

gel– SDS (detergent) breaks

structure and charges the proteins proportional to their mass.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 31: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Steps of Mass Spectrometry

• Digest:– Sample (spot) is digested with a

proteolytic enzyme

• Spectrum:– Peaks correspond to the mass-

charge ratio of protein fragments

– These provide a fingerprint

• Identify:– Compare fingerprint to

theoretical fingerprints– Post-translational

modifications screw things up.

Page 32: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Goals

• We’ve measured the expression of genes or proteins using the technologies discussed previously.

• What can we do with that information?– Identify significant differences in

expression– Identify similar patterns of expression

(clustering)

Page 33: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Analysis steps

1. Data normalization

2. Statistical Analysis

3. Cluster Analysis

Page 34: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Data Normalization

• Why normalize?– Removes systematic errors– Makes the data easier to analyze

statistically

Page 35: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Sources of Error

• Measurements always contain errors.– Systematic (oops)– Random (noise!)

• Subtracting the background level can remove some systematic error– Using the ratio in two-channel experiments does this– Subtracting the overall average intensity can be used with

one-channel data.

• Taking averages over replicates of the experiment reduces the random error.

Page 36: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Statistical Analysis

• Determining what differences in expression are statistically significant

• Controlling false positives

Page 37: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

When are two measurements significantly different?

• We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1).

• A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small.

• The significance is related to the area of the overlap of the underlying distributions.

QuickTime™ and a decompressor

are needed to see this picture.

Page 38: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

The Z-test

• If the data is approximately normal, convert it to a Z-score.– X can be the log expression ratio; is then 0 is the sample standard deviation; n is the number of repeats

• The Z-score is distributed N(0,1) (standard normal).• The significance level is the area in the tail(s) of the standard

normal distribution.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressorare needed to see this picture.

Page 39: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

The t-test

• The t-test makes fewer assumptions about the data than the Z-test

• It can be applied to compare two average measurements which can have– Different variances– Different numbers of observations

Page 40: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Cluster Analysis

• Similar expression patterns– Groups of genes/proteins with similar

expression profiles

• Similar expression sub-patterns– Groups of genes/proteins with similar

expression profiles in a subset of conditions

Page 41: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Distance Measures Between Pairs of Points

• In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other.

• So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix.

• We can then compute all the pair-wise distances between rows (or columns).

Page 42: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Standard Distance Measures

• Euclidean Distance

• Pearson Correlation Coefficient

• Mahalanobis Distance

Page 43: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Euclidean Distance

• Standard, everyday distance – Treats all dimensions equally– If some genes vary more than others (have higher

variance), they influence the distance more.

Page 44: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Mahalanobis Distance

• The “normalized” Euclidean distance• Scales each dimension by the variance in that dimension.

– This is useful if the genes tend to vary much more in one sample than in others since it reduces the affect of that sample on the distances.

Page 45: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Pearson Correlation Coefficient

• Distances are small when two genes have similar patterns of change even if the size of the changes are different.

• This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.

Page 46: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Types of Linkage

• A. Single Linkage• B. Complete

Linkage• C. Centroid Method

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 47: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Sample Questions

1. Gene expression analysis – a. What kind of molecules do expression microarrays measure?

– b. Expression microarray data is known to be “noisy”. Describe as many ways as you can of reducing this problem.

– c. What experimental technique is commonly used to validate the results of expression microarrays?

– d. The “Z-test” or “t-test” is usually applied to expression microarray data. Why is this done and what do these tests tell us?

– e. Principle components analysis is often applied to microarray data as well. What is its purpose and what can it tell us?

– f. Name two types of distance measures that can be used with microarray data for clustering expression profiles.

Page 48: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Phylogeny

Page 49: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Overview

• Evolution and sequence variation

• Phylogenetic trees– The meaning of distance– Evolutionary sequence models

• Constructing trees

Page 50: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Rooted and Unrooted Trees

• “Leaves” are extant species• Internal nodes are ancestral species• Adding a root gives time a direction

Page 51: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

But how can two species be at different “evolutionary distances” from their ancestor?

?

Page 52: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Distance Time

• The rate of evolution, r, can vary over time.

• The distance is equal to the rate times the time:

d=rt

Page 53: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

What is the evolutionary distance between two DNA sequences?

• Align the two DNA sequences.

• Count the number of places where they differ (ignoring gaps)

p = D/L– D is the number of differences and– L is the total number of aligned positions

Page 54: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Is p the evolutionary distance?

• NO!

• p is just the observed number of differences.– What is value will p tend towards as

evolutionary distance increases???

Page 55: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Relationship between p-distance and d-distance

• So the branch lengths of the tree are “d=rt”.

• We must propose an evolutionary model to compute “d” from the observed p-distance.

Page 56: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Jukes-Cantor Evolutionary Model

• Assumes all base frequencies are ¼

• Has one parameter, α, the substitution rate (per unit time).

• Distance formula: d = ¾ ln(1- 4⁄3 p)

Page 57: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Transitions and Transversions

Page 58: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Kimura Two-Parameter Model

• Models transversions and transitions separately because the former are very uncommon in reality.– Transitions: A<->G, C<->T– Two parameters: transition rate α, transversion rate

β.

• Distance formula:

d = ½ ln(1-2P-Q) - ¼ ln(1-2Q) where P and Q are fraction of transitions and

transversions, respectively.

Page 59: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Constructing Phylogenetic Trees

Page 60: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

First, construct a multiple alignment

• A good multiple alignment is key.• The p-distances between pairs of

sequences can then be computed.• This allows the d-distances between

pairs of sequences to be computed.• Some tree-building methods use the

multiple alignment directly– Parsimony Methods

Page 61: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Tree-building methods

• UPGMA (1958)– Builds rooted, ultrametric trees– Assumes constant rate of evolution in all branches

• Neighbor-joining (1987)– Builds unrooted, additive trees– Assumes the best tree has the shortest total

branch length.– Principal of minimum evolution, as with maximum

parsimony trees.

Page 62: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Ultrametric Trees

• Simplest type of rooted, additive tree.

• Assumes that the rate of evolution is constant over time.– With sequences,

called the “molecular clock”.

– Horizontal lines have no meaning.

Page 63: BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Example Questions

1. Phylogeny– a. What does adding a root to a tree tell us?– b. In an additive tree, what is the meaning of

branch length?– c. What is the difference between p-distance

and d-distance?– d. What is the difference between the Jukes-

Cantor evolutionary model and the Kimura two-parameter model?

– e. What does the neighbor-joining tree building method minimize?

– f. What kind of trees does it build?